This is a study note for Tibble data form, Data Import, Parsing vectors and files, Writing to a file, Tidy data, which are parts of data wrangling. For more details on the study material see https://r4ds.had.co.nz/wrangle-intro.html.
library(dplyr) # data manipulation
library(tibble) # tibble form
library(readr) # read data
library(tidyr) # tidy messy data
The \(tibble\) package is a part of the core \(tidyverse\).
as_tibble()
: coerce a data frame to a tibble.tibble()
: create a new tibble from individual vectors or attributes.class()
: check type of data.print(n = 10, width = Inf)
.[[
can extract by name or position; $
only extracts by name.The \(readr\) package, which is part of the core \(tidyverse\).
read_csv()
: reads comma delimited files.
skip = n
to skip the first n
lines; or use comment = "#"
to drop all lines that start with (e.g.) #
. ie. read_csv(x, skip=2)
.col_names = FALSE
to tell read_csv()
not to treat the first row as headings. ie. read_csv(x, col_names=False)
.read_csv2()
: reads semicolon separated files (common in countries where , is used as the decimal place),
read_tsv()
reads tab delimited files, and read_delim() reads in files with any delimiter.
read_fwf()
reads fixed width files. You can specify fields either by their widths with fwf_widths() or their position with fwf_positions().
read_table()
reads a common variation of fixed width files where columns are separated by white space.
read_log()
reads Apache style log files. (But also check out webreadr which is built on top of read_log() and provides many more helpful tools.)
problems(x)
/ stop_for_problems(x)
: access to errors while parsing fails / throw an error if there are any parsing problems
For more information, check out the Note-readFile with in the Note folder.
read_csv()
vs. read.csv()
The parse_*()
functions take a character vector and return a more specialised vector like a logical, integer, or date. Generic use are:
parse_logical(c("TRUE", "FALSE", "NA")) %>% str()
## logi [1:3] TRUE FALSE NA
parse_integer(c("1", "2", "3")) %>% str()
## int [1:3] 1 2 3
parse_date(c("2010-01-01", "1979-10-14")) %>% str()
## Date[1:2], format: "2010-01-01" "1979-10-14"
parse_integer(c("123", "345", "abc", "123.45")) %>% problems()
## # A tibble: 2 x 4
## row col expected actual
## <int> <int> <chr> <chr>
## 1 3 NA an integer abc
## 2 4 NA no trailing characters .45
Using parsers is mostly a matter of understanding what’s available and how they deal with different types of input. There are eight particularly important parsers:
parse_double()
is a strict numeric parser, and parse_number()
is a flexible numeric parser. These are more complicated than you might expect because different parts of the world write numbers in different ways.
Issue 1: People write numbers differently in different parts of the world. For example, some countries use . in between the integer and fractional parts of a real number, while others use. Solution:
parse_double("1,23", locale = locale(decimal_mark = ","))
## [1] 1.23
Issue 2: Numbers are often surrounded by other characters that provide some context, like “$1000” or “10%”. Solution:
parse_number("It cost $123.45")
## [1] 123.45
Issue 3: Numbers are often surrounded by other characters that provide some context, like “$1000” or “10%”. Solution:
parse_number("$123,456,789") # default is Amerian grouping mark
## [1] 123456789
parse_number("123.456.789", locale = locale(grouping_mark = ".")) # Used in many parts of Europe
## [1] 123456789
parse_number("123'456'789", locale = locale(grouping_mark = "'")) # Used in Switzerland
## [1] 123456789
parse_character()
seems so simple that it shouldn’t be necessary. But one complication makes it quite important: character encodings.\(readr\) use UTF-8 encoding as defualt for reading and writing. If the data is not UTF-8 encoded, then do the following to
x1 <- "El Ni\xf1o was particularly bad this year"
guess_encoding(charToRaw(x1)) # Latin1 (aka ISO-8859-1)
## # A tibble: 2 x 2
## encoding confidence
## <chr> <dbl>
## 1 ISO-8859-1 0.46
## 2 ISO-8859-9 0.23
x1 %>% parse_character(locale = locale(encoding = "Latin1"))
## [1] "El Niño was particularly bad this year"
parse_factor()
create factors, the data structure that R uses to represent categorical variables with fixed and known values. But if you have many problematic entries, it’s often easier to leave as character vectors and then use the tools you’ll learn about in strings and factors to clean them up
fruit <- c("apple", "banana")
parse_factor(c("apple", "banana", "bananana"), levels = fruit)
## [1] apple banana <NA>
## attr(,"problems")
## # A tibble: 1 x 4
## row col expected actual
## <int> <int> <chr> <chr>
## 1 3 NA value in level set bananana
## Levels: apple banana
parse_datetime()
, parse_date()
, and parse_time()
allow you to parse various date & time specifications. These are the most complicated because there are so many different ways of writing dates.
parse_datetime()
expects an ISO8601 date-time, which is an international standard in which the components of a date are organised from biggest to smallest: year, month, day, hour, minute, second.
parse_datetime("2010-10-01T2010")
## [1] "2010-10-01 20:10:00 UTC"
parse_datetime("20101010")
## [1] "2010-10-10 UTC"
parse_date()
expects a four digit year, a - or /, the month, a - or /, then the day:
parse_date("2010-10-01")
## [1] "2010-10-01"
parse_time()
expects the hour, :, minutes, optionally : and seconds, and an optional am/pm specifier:
library(hms)
parse_time("01:10 am")
## 01:10:00
parse_time("20:10:01")
## 20:10:01
parse_logical()
and parse_integer()
parse logicals and integers respectively. There’s basically nothing that can go wrong with these parsers so I won’t describe them here further.
Parsing a file is quickly identify and assign a type to column or attribute in the inputting data set.
\(readr\) uses a heuristic to figure out the type of each column: it reads the first 1000 rows and uses some (moderately conservative) heuristics to figure out the type of each column. You can emulate this process with a character vector using guess_parser()
, which returns readr’s best guess, and parse_guess()
which uses that guess to parse the column:
guess_parser("2010-10-01")
## [1] "date"
str(parse_guess("2010-10-10"))
## Date[1:1], format: "2010-10-10"
The heuristic tries each of the following types, stopping when it finds a match:
If none of these rules apply, then the column will stay as a vector of strings.
If enconter Problems of parsing, check the link for more details: https://r4ds.had.co.nz/data-import.html. If reading a very large file, you might want to set n_max to a smallish number like 10,000 or 100,000. That will accelerate your iterations while you eliminate common problems
library(feather)
challenge <- read_csv(readr_example("challenge.csv"), guess_max = 1001)
## Parsed with column specification:
## cols(
## x = col_double(),
## y = col_date(format = "")
## )
\(readr\) also comes with two useful functions for writing data back to disk:
write_csv()
and write_tsv()
: writing data back to disk, make sure the following to increase the chances of the output file being read back in correctly
write_excel_csv()
: this writes a special character (a “byte order mark”) at the start of the file which tells Excel that you’re using the UTF-8 encoding.Note that the type information is lost when you save to csv, The feather package implements a fast binary file format that can be shared across programming languages:
library(feather)
write_feather(challenge, "challenge.feather")
read_feather("challenge.feather")
## # A tibble: 2,000 x 2
## x y
## <dbl> <date>
## 1 404 NA
## 2 4172 NA
## 3 3004 NA
## 4 787 NA
## 5 37 NA
## 6 2332 NA
## 7 2489 NA
## 8 1449 NA
## 9 3665 NA
## 10 3863 NA
## # … with 1,990 more rows
To get other types of data into R, we recommend starting with the tidyverse packages listed below. They’re certainly not perfect, but they are a good place to start. For rectangular data:
\(haven\) reads SPSS, Stata, and SAS files.
\(readxl\) reads excel files (both .xls and .xlsx).
\(DBI\), along with a database specific backend (e.g. \(RMySQL\), \(RSQLite\), \(RPostgreSQL\) etc) allows you to run SQL queries against a database and return a data frame.
For hierarchical data: use \(jsonlite\) (by Jeroen Ooms) for json, and \(xml2\) for XML. Jenny Bryan has some excellent worked examples at https://jennybc.github.io/purrr-tutorial/.
tidy data is a consistent way to organise your data in R. Getting your data into this format requires some upfront work, but that work pays off in the long term. Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the analytic questions at hand. There are three interrelated rules which make a dataset tidy:
These three rules are interrelated because it’s impossible to only satisfy two of the three. That interrelationship leads to an even simpler set of practical instructions:
\(dplyr\), \(ggplot2\), and all the other packages in the \(tidyverse\) are designed to work with tidy data.
To resolve one of two common problems:
A common problem is a dataset where some of the column names are not names of variables, but values of a variable.gathering()
gathers those columns into a new pair of variables
table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
table4b
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583
tidy4a <- table4a %>%
gather(`1999`, `2000`, key = "year", value = "cases")
tidy4b <- table4b %>%
gather(`1999`, `2000`, key = "year", value = "population")
left_join(tidy4a, tidy4b)
## Joining, by = c("country", "year")
## # A tibble: 6 x 4
## country year cases population
## <chr> <chr> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Brazil 1999 37737 172006362
## 3 China 1999 212258 1272915272
## 4 Afghanistan 2000 2666 20595360
## 5 Brazil 2000 80488 174504898
## 6 China 2000 213766 1280428583
Spreading()
is the opposite of gathering. You use it when an observation is scattered across multiple rows.
table2
## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
table2 %>%
spread(key = type, value = count)
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
separate()
pulls apart one column into multiple columns, by splitting wherever a separator character appears.
sep
argument of separate()
.separate()
to try and convert to better types using convert = TRUE
.table3
## # A tibble: 6 x 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
table3 %>%
separate(rate, into = c("cases", "population"), sep = ":")
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <chr> <chr>
## 1 Afghanistan 1999 745/19987071 <NA>
## 2 Afghanistan 2000 2666/20595360 <NA>
## 3 Brazil 1999 37737/172006362 <NA>
## 4 Brazil 2000 80488/174504898 <NA>
## 5 China 1999 212258/1272915272 <NA>
## 6 China 2000 213766/1280428583 <NA>
table3 %>%
separate(rate, into = c("cases", "population"), convert = TRUE)
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
unite()
is the inverse of separate()
: it combines multiple columns into a single column.
sep
argument. The default will place an underscore (_
) between the values from different columns.table5
## # A tibble: 6 x 4
## country century year rate
## * <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583
table5 %>%
unite(new, century, year, sep = "")
## # A tibble: 6 x 3
## country new rate
## <chr> <chr> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
Changing the representation of a dataset brings up an important subtlety of missing values. Surprisingly, a value can be missing in one of two possible ways:
stocks <- tibble(
year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
qtr = c( 1, 2, 3, 4, 2, 3, 4),
return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)
)
make the implicit missing value explicit by putting years in the columns
stocks
## # A tibble: 7 x 3
## year qtr return
## <dbl> <dbl> <dbl>
## 1 2015 1 1.88
## 2 2015 2 0.59
## 3 2015 3 0.35
## 4 2015 4 NA
## 5 2016 2 0.92
## 6 2016 3 0.17
## 7 2016 4 2.66
stocks %>%
spread(year, return)
## # A tibble: 4 x 3
## qtr `2015` `2016`
## <dbl> <dbl> <dbl>
## 1 1 1.88 NA
## 2 2 0.59 0.92
## 3 3 0.35 0.17
## 4 4 NA 2.66
set na.rm = TRUE in gather() to turn explicit missing values implicit
stocks %>%
spread(year, return) %>%
gather(year, return, `2015`:`2016`, na.rm = TRUE)
## # A tibble: 6 x 3
## qtr year return
## <dbl> <chr> <dbl>
## 1 1 2015 1.88
## 2 2 2015 0.59
## 3 3 2015 0.35
## 4 2 2016 0.92
## 5 3 2016 0.17
## 6 4 2016 2.66
complete()
takes a set of columns, and finds all unique combinations. It then ensures the original dataset contains all those values, filling in explicit NAs where necessary.
stocks %>%
complete(year, qtr)
## # A tibble: 8 x 3
## year qtr return
## <dbl> <dbl> <dbl>
## 1 2015 1 1.88
## 2 2015 2 0.59
## 3 2015 3 0.35
## 4 2015 4 NA
## 5 2016 1 NA
## 6 2016 2 0.92
## 7 2016 3 0.17
## 8 2016 4 2.66
fill()
takes a set of columns where you want missing values to be replaced by the most recent non-missing value (sometimes called last observation carried forward).
treatment <- tribble(
~ person, ~ treatment, ~response,
"Derrick Whitmore", 1, 7,
NA, 2, 10,
NA, 3, 9,
"Katherine Burke", 1, 4
)
treatment
## # A tibble: 4 x 3
## person treatment response
## <chr> <dbl> <dbl>
## 1 Derrick Whitmore 1 7
## 2 <NA> 2 10
## 3 <NA> 3 9
## 4 Katherine Burke 1 4
treatment %>%
fill(person)
## # A tibble: 4 x 3
## person treatment response
## <chr> <dbl> <dbl>
## 1 Derrick Whitmore 1 7
## 2 Derrick Whitmore 2 10
## 3 Derrick Whitmore 3 9
## 4 Katherine Burke 1 4