Use of this document

This is a study note for data type. Some additional info are the following:

comparing object-oriented prgramming (OO) vs Procedure Oriented programming (PO) in term of Message Passing: http://blog.fens.me/r-object-oriented-intro/:
- Encapsulation
- Inheritance
- Polymorphism
- Abstraction
adv-r: S3 vs R6, S4 vs RC objects: https://adv-r.hadley.nz/oo-tradeoffs.html:
- Overall, when picking an OO system, I recommend that you default to S3. S3 is simple, and widely used throughout base R and CRAN.
- S3 is R’s first and simplest OO system. S3 is informal and ad hoc, but there is a certain elegance in its minimalism.
- S4 is more formal and tends to require more upfront planning. That makes it more suitable for big projects developed by teams, not individuals.
- R6 is a profoundly different OO system from S3 and S4 because it is built on encapsulated objects, rather than generic functions.
pkg $pryr$: Useful tools to pry back the covers of R and understand the language at a deeper level. http://blog.fens.me/r-pryr/.

Prerequisites

Basic data structure manipulation
Data structure	Create	Indexing	Coerce
Vector	`c()`	`(1:5)`	`as.vector()`
Matrix	`matrix( , nrow = , ncol = )`	`[1,1]`	`as.matrix()`
Array	`array( , dim=c())`	`[ , 2, , ]`	`as.array()`
List	`list( , , , ,)`	`$ID`, `[["ID"]]` or `[[1]]`, `["ID"]` or `[1]`	`as.list()`
Data frame	`data.frame()`	`$ID`, `[["ID"]]`, `[1, 1]`	`as.data.frame()`

# essential
library(tidyverse)

1. Attributes

All objects can have arbitrary additional attributes, used to store metadata about the object. Attributes can be thought of as a named vector or list (with unique names). Attributes can be accessed individually with attr() or all at once (as a list) with attributes(). The only attributes not lost are the three most important:

names(x): Names / dimnames, a character vector giving each element a name.
dim(x): Dimensions, used to turn vectors into matrices and arrays, described in matrices and arrays.
class(x): Class, used to implement the S3 object system, described in S3.

Some other attributes

metadata(): set a metadata to a Raster object

Attributes-relative function for `vector`, `matrix`, `arry`
function description	Vector	Matrix	Array
get names	`names()`	`rownames()`, `colnames()`	`dimnames()`
get length	`length()`	`nrow()`, `ncol()`	`dim()`
combine	`c()`	`rbind()`, `cbind()`	`abind::abind()`
transpose	-	`t()`	`aperm()`
check if type	`is.null(dim(x))`	`is.matrix()`	`is.array()`

Attributes-relative function for `Matrix`, `Data frame`
function description	List	Matrix
get names	`names()`	`rownames()`, `colnames()`
get length	`length()`	`nrow()`, `ncol()`
combine	`c()`	`rbind()`, `cbind()`
transpose	-	`t()`
check if type	`is.null(dim(x))`	`is.matrix()`

1.1 Name

You can name a vector in three ways:

x <- c(a = , b = , c = ): When creating it.
- x <- c(a = 1, b = 2, c = 3).
names(x) <- c("a", "b", "c") or names(x)[[1]]: By modifying an existing vector in place.
- x <- 1:3; names(x) <- c("a", "b", "c")
- x <- 1:3; names(x)[[1]] <- c("a").
setNames(x, c("a", "b", "c")) By creating a modified copy of a vector.
- x <- setNames(1:3, c("a", "b", "c")).

1.2 Dimensions

Adding a dim attribute to a vector allows it to behave like a 2-dimensional matrix or a multi-dimensional array.

1.3 Class

Class is a property assigned to an object that determines how generic functions operate with it. It is not a mutually exclusive classification. If an object has no specific class assigned to it, such as a simple numeric vector, it’s class is usually the same as its mode, by convention. Class is based on R’s object-oriented class hierarchy, shown at below:

lowest-level data type
Data Type	Example	Verify: class()
Logical	TRUE, FALSE	`v <- TRUE`: `logical`
Numeric	12.3, 5, 999	`v <- 23.5`: `numeric`
Integer	2L, 34L, 0L	`v <- 2L`: `integer`
Complex	3 + 2i	`v <- 2+5i`: `complex`
Character	“a” , “good,”TRUE“,”23.4""	`v <- "TRUE"`: `character`
Raw	“ello” is stored as 48 65 6c 6c 6f	`v <- charToRaw("Hello)`: `raw`

is.xxx: checking if the data type is xxx.

2. Basic data structure

Basic data structure represented by dimention and type
	One type structure	Multiple types structure
1-Dimension	(Atomic) Vector	List
2-Dimension	Matrix	Data frame
n-Dimension	Array

class(): an object’s object-oriented classification according to the R class hierarchy. (high-level, e.g. data.frame)?
typeof(): the (R internal) type or storage mode of any object (low-level, e.g. list)?
mode(): Even though their class (their position in the class hierarchy) is something completely different, ‘mode’ is a mutually exclusive classification of objects according to their basic structure. The ‘atomic’ modes are numeric, complex, character and logical. Recursive objects have modes such as ‘list’ or ‘function’ or a few others. An object has one and only one mode.
length(): how long is it? What about two dimensional objects?
attributes(): does it have any metadata?

2.1 Vectors

vector type data support vector implimentation, which processes one operation on multiple pairs of operands at once.

Atomic vectors are usually created with c(), short for combine.
Atomic vectors are always flat, even if you nest c()’s
anyNA(): returns TRUE if the vector contains any missing values.
is.na(): indicates the elements of the vectors that represent missing data.

a <- c(1:3) # interger vector
b <- c(FALSE,TRUE,FALSE) #logical vector
c <- c("one","two","three") # character vector

d <- seq(1:3) # interger vector
d <- seq(from = 1, to = 30, by = 10)
df <- data.frame(a=a,b=b,c=c,d=d)
str(df)

## 'data.frame':    3 obs. of  4 variables:
##  $ a: int  1 2 3
##  $ b: logi  FALSE TRUE FALSE
##  $ c: Factor w/ 3 levels "one","three",..: 1 3 2
##  $ d: num  1 11 21

sapply(df, class)

##         a         b         c         d 
## "integer" "logical"  "factor" "numeric"

sapply(df, typeof)

##         a         b         c         d 
## "integer" "logical" "integer"  "double"

sapply(df, mode)

##         a         b         c         d 
## "numeric" "logical" "numeric" "numeric"

2.1.1 Matrices and Arrays

Adding a dim attribute to an atomic vector allows it to behave like a multi-dimensional array. A special case of the array is the matrix, which has two dimensions. Matrices are used commonly as part of the mathematical machinery of statistics. Arrays are much rarer, but worth being aware of.

Matrices and arrays are created with matrix() and array(), or by using the assignment form of dim():

# create
## Two scalar arguments specify row and column sizes
a <- matrix(1:6, nrow = 2, ncol = 3)
## One vector argument to describe all dimensions
b <- array(1:12, c(2, 3, 2))
# You can also modify an object in place by setting dim()
c <- 1:6
dim(c) <- c(3, 2)

2.2 Lists

Lists are a step up in complexity from atomic vectors: each element can be any type, not just vectors. You construct lists by using list() instead of c():

lists allow the most flexibility
a list element can be anything
very useful (and common) as output from analytic methods
lists can also be variables in a dataset

List is built on top of vector, whereas data frames and tibbles is built on top of list. therefore, compare to list, Vector is lower-level, data frames and tibbles is higher-level. The following is an exmaple of the data object architecture:

mod<-glm(mpg~cyl+disp+hp+drat+wt, data=mtcars)
mode(mod)

## [1] "list"

a <- list(1:10) # interger vector
b <- list(1:3, "a", c(TRUE, FALSE, TRUE), c(2.3, 5.9))
str(b)

## List of 4
##  $ : int [1:3] 1 2 3
##  $ : chr "a"
##  $ : logi [1:3] TRUE FALSE TRUE
##  $ : num [1:2] 2.3 5.9

2.2.1 Data Frames

A data frame is a named list of vectors with attributes for (column) names, row.names, and its class, data.frame.

A data frame has rownames() and colnames(). The names() of a data frame are the column names.
A data frame has nrow() rows and ncol() columns. The length() of a data frame gives the number of columns.

df1 <- data.frame(x = 1:3, y = letters[1:3])
print(df1)

##   x y
## 1 1 a
## 2 2 b
## 3 3 c

3. S3 atomic vectors

One of the most important vector attributes is class, which underlies the S3 object system. Having a class attribute turns an object into an S3 object, which means it will behave differently from a regular vector when passed to a generic function. Every S3 object is built on top of a base type, and often stores additional information in other attributes. In this section, we’ll discuss four important S3 vectors used in base R:

Categorical data, where values come from a fixed set of levels recorded in factor vectors.
Dates (with day resolution), which are recorded in Date vectors.
Date-times (with second or sub-second resolution), which are stored in POSIXct vectors.
Durations, which are stored in difftime vectors.

The following is the schema of the S3 object system.

3.1 Factors

A factor is a vector that can contain only predefined values, and is used to store categorical data

most data loading functions in R automatically convert character vectors to factors, unfortunately
- use the argument stringsAsFactors = FALSE to suppress this behaviour,
some build-in character list can be convert to factor with specific order
- month.name
- month.abb
- state.name
- state.abb

# creating
a <- factor(c("a", "b", "b", "a"))
levels(a)

## [1] "a" "b"

3.2 Date

(check the note for $lubridate$ package)

3.3 Date-time

(check the note for $lubridate$ package)

3.4 Durations

(check the note for $lubridate$ package)

Note-Data_type

Weiquan Luo

Updated by 2021-01-21

Use of this document

Prerequisites

1. Attributes

1.1 Name

1.2 Dimensions

1.3 Class

2. Basic data structure

2.1 Vectors

2.1.1 Matrices and Arrays

2.2 Lists

2.2.1 Data Frames

3. S3 atomic vectors

3.1 Factors

3.2 Date

3.3 Date-time

3.4 Durations