Use of this document

This is a study note for data type. Some additional info are the following:

Prerequisites

Basic data structure manipulation
Data structure Create Indexing Coerce
Vector c() (1:5) as.vector()
Matrix matrix( , nrow = , ncol = ) [1,1] as.matrix()
Array array( , dim=c()) [ , 2, , ] as.array()
List list( , , , ,) $ID, [["ID"]] or [[1]], ["ID"] or [1] as.list()
Data frame data.frame() $ID, [["ID"]], [1, 1] as.data.frame()
# essential
library(tidyverse)

1. Attributes

All objects can have arbitrary additional attributes, used to store metadata about the object. Attributes can be thought of as a named vector or list (with unique names). Attributes can be accessed individually with attr() or all at once (as a list) with attributes(). The only attributes not lost are the three most important:

Some other attributes

Attributes-relative function for vector, matrix, arry
function description Vector Matrix Array
get names names() rownames(), colnames() dimnames()
get length length() nrow(), ncol() dim()
combine c() rbind(), cbind() abind::abind()
transpose - t() aperm()
check if type is.null(dim(x)) is.matrix() is.array()
Attributes-relative function for Matrix, Data frame
function description List Matrix
get names names() rownames(), colnames()
get length length() nrow(), ncol()
combine c() rbind(), cbind()
transpose - t()
check if type is.null(dim(x)) is.matrix()

1.1 Name

You can name a vector in three ways:

  • x <- c(a = , b = , c = ): When creating it.
    • x <- c(a = 1, b = 2, c = 3).
  • names(x) <- c("a", "b", "c") or names(x)[[1]]: By modifying an existing vector in place.
    • x <- 1:3; names(x) <- c("a", "b", "c")
    • x <- 1:3; names(x)[[1]] <- c("a").
  • setNames(x, c("a", "b", "c")) By creating a modified copy of a vector.
    • x <- setNames(1:3, c("a", "b", "c")).

1.2 Dimensions

Adding a dim attribute to a vector allows it to behave like a 2-dimensional matrix or a multi-dimensional array.

1.3 Class

Class is a property assigned to an object that determines how generic functions operate with it. It is not a mutually exclusive classification. If an object has no specific class assigned to it, such as a simple numeric vector, it’s class is usually the same as its mode, by convention. Class is based on R’s object-oriented class hierarchy, shown at below:

R’s object-oriented class hierarchy
lowest-level data type
Data Type Example Verify: class()
Logical TRUE, FALSE v <- TRUE: logical
Numeric 12.3, 5, 999 v <- 23.5: numeric
Integer 2L, 34L, 0L v <- 2L: integer
Complex 3 + 2i v <- 2+5i: complex
Character “a” , “good,”TRUE“,”23.4"" v <- "TRUE": character
Raw “ello” is stored as 48 65 6c 6c 6f v <- charToRaw("Hello): raw
  • is.xxx: checking if the data type is xxx.

2. Basic data structure

Basic data structure represented by dimention and type
One type structure Multiple types structure
1-Dimension (Atomic) Vector List
2-Dimension Matrix Data frame
n-Dimension Array

2.1 Vectors

vector type data support vector implimentation, which processes one operation on multiple pairs of operands at once.

  • Atomic vectors are usually created with c(), short for combine.

  • Atomic vectors are always flat, even if you nest c()’s

  • anyNA(): returns TRUE if the vector contains any missing values.

  • is.na(): indicates the elements of the vectors that represent missing data.

a <- c(1:3) # interger vector
b <- c(FALSE,TRUE,FALSE) #logical vector
c <- c("one","two","three") # character vector

d <- seq(1:3) # interger vector
d <- seq(from = 1, to = 30, by = 10)
df <- data.frame(a=a,b=b,c=c,d=d)
str(df)
## 'data.frame':    3 obs. of  4 variables:
##  $ a: int  1 2 3
##  $ b: logi  FALSE TRUE FALSE
##  $ c: Factor w/ 3 levels "one","three",..: 1 3 2
##  $ d: num  1 11 21
sapply(df, class)
##         a         b         c         d 
## "integer" "logical"  "factor" "numeric"
sapply(df, typeof)
##         a         b         c         d 
## "integer" "logical" "integer"  "double"
sapply(df, mode)
##         a         b         c         d 
## "numeric" "logical" "numeric" "numeric"

2.1.1 Matrices and Arrays

Adding a dim attribute to an atomic vector allows it to behave like a multi-dimensional array. A special case of the array is the matrix, which has two dimensions. Matrices are used commonly as part of the mathematical machinery of statistics. Arrays are much rarer, but worth being aware of.

Matrices and arrays are created with matrix() and array(), or by using the assignment form of dim():

# create
## Two scalar arguments specify row and column sizes
a <- matrix(1:6, nrow = 2, ncol = 3)
## One vector argument to describe all dimensions
b <- array(1:12, c(2, 3, 2))
# You can also modify an object in place by setting dim()
c <- 1:6
dim(c) <- c(3, 2)

2.2 Lists

Lists are a step up in complexity from atomic vectors: each element can be any type, not just vectors. You construct lists by using list() instead of c():

  • lists allow the most flexibility
  • a list element can be anything
  • very useful (and common) as output from analytic methods
  • lists can also be variables in a dataset

List is built on top of vector, whereas data frames and tibbles is built on top of list. therefore, compare to list, Vector is lower-level, data frames and tibbles is higher-level. The following is an exmaple of the data object architecture:

data object architecture
mod<-glm(mpg~cyl+disp+hp+drat+wt, data=mtcars)
mode(mod)
## [1] "list"
a <- list(1:10) # interger vector
b <- list(1:3, "a", c(TRUE, FALSE, TRUE), c(2.3, 5.9))
str(b)
## List of 4
##  $ : int [1:3] 1 2 3
##  $ : chr "a"
##  $ : logi [1:3] TRUE FALSE TRUE
##  $ : num [1:2] 2.3 5.9

2.2.1 Data Frames

A data frame is a named list of vectors with attributes for (column) names, row.names, and its class, data.frame.

  • A data frame has rownames() and colnames(). The names() of a data frame are the column names.
  • A data frame has nrow() rows and ncol() columns. The length() of a data frame gives the number of columns.
df1 <- data.frame(x = 1:3, y = letters[1:3])
print(df1)
##   x y
## 1 1 a
## 2 2 b
## 3 3 c

3. S3 atomic vectors

One of the most important vector attributes is class, which underlies the S3 object system. Having a class attribute turns an object into an S3 object, which means it will behave differently from a regular vector when passed to a generic function. Every S3 object is built on top of a base type, and often stores additional information in other attributes. In this section, we’ll discuss four important S3 vectors used in base R:

The following is the schema of the S3 object system.

S3 object system

3.1 Factors

A factor is a vector that can contain only predefined values, and is used to store categorical data

  • most data loading functions in R automatically convert character vectors to factors, unfortunately
    • use the argument stringsAsFactors = FALSE to suppress this behaviour,
  • some build-in character list can be convert to factor with specific order
    • month.name
    • month.abb
    • state.name
    • state.abb
# creating
a <- factor(c("a", "b", "b", "a"))
levels(a)
## [1] "a" "b"

3.2 Date

(check the note for \(lubridate\) package)

3.3 Date-time

(check the note for \(lubridate\) package)

3.4 Durations

(check the note for \(lubridate\) package)