Frequently Used R Data Structures in Machine Learning

Packt Publishing

5.00/5 (2 votes)

Nov 5, 2019

CPOL

13 min read

8999

How to solve real-world problems with R and Machine learning

There are numerous types of data structures across programming languages, each with strengths and weaknesses suited to specific tasks. Since R is a programming language used widely for statistical data analysis, the data structures it utilizes were designed with this type of work in mind.

The R data structures used most frequently in machine learning are vectors, factors, lists, arrays, matrices, and data frames. Each is tailored to a specific data management task, which makes it important to understand how they will interact in your R project. In this article, we will go through different types of R data structures and understand the similarities and differences between them.

This post is taken from the book Machine Learning with R - Third Edition, by Packt Publishing and written by Brett Lantz. This book will help you solve real-world problems with R and Machine learning.

Vectors

The fundamental R data structure is the vector, which stores an ordered set of values called elements. A vector can contain any number of elements. However, all of its elements must be of the same type; for instance, a vector cannot contain both numbers and text. To determine the type of vector v, use the typeof(v) command.

Several vector types are commonly used in machine learning: integer (numbers without decimals), double (numbers with decimals), character (text data), and logical (TRUE or FALSE values). There are also two special values: NA, which indicates a missing value, and NULL, which is used to indicate the absence of any value. Although these two may seem to be synonymous, they are indeed slightly different. The NA value is a placeholder for something else and therefore has a length of one, while the NULL value is truly empty and has a length of zero.

It is tedious to enter large amounts of data by hand, but simple vectors can be created by using the c() combine function. The vector can also be given a name using the arrow <- operator. This is R's assignment operator, used much like the = assignment operator in many other programming languages.

For example, let's construct a set of vectors containing data on three medical patients. We'll create a character vector named subject_name to store the three patient names, a numeric vector named temperature to store each patient's body temperature in degrees Fahrenheit, and a logical vector named flu_status to store each patient's diagnosis (TRUE if he or she has influenza, FALSE otherwise). As shown in the following code, the three vectors are:

> subject_name <- c("John Doe", "Jane Doe", "Steve Graves")> temperature 
<- c(98.1, 98.6, 101.4)> flu_status <- c(FALSE, FALSE, TRUE)

Values stored in R vectors retain their order. Therefore, data for each patient can be accessed using his or her position in the set, beginning at 1, then supplying this number inside square brackets (that is, [ and ]) following the name of the vector. For instance, to obtain the temperature value for patient Jane Doe, the second patient, simply type:

> temperature[2][1] 98.6

R offers a variety of methods to extract data from vectors. A range of values can be obtained using the colon operator. For instance, to obtain the body temperature of the second and third patients, type:

> temperature[2:3][1] 98.6 101.4

Items can be excluded by specifying a negative item number. To exclude the second patient's temperature data, type:

> temperature[-2][1]  98.1 101.4

It is also sometimes useful to specify a logical vector indicating whether or not each item should be included. For example, to include the first two temperature readings but exclude the third, type:

> temperature[c(TRUE, TRUE, FALSE)][1] 98.1 98.6

As you will see shortly, the vector provides the foundation for many other R data structures. Therefore, knowing the various vector operations is crucial for working with data in R.

For the most up-to-date R code, as well as issue tracking and a public wiki, please join the community!

Factors

If you recall from Chapter 1, Introducing Machine Learning, nominal features represent a characteristic with categories of values. Although it is possible to use a character vector to store nominal data, R provides a data structure specifically for this purpose. A factor is a special case of vector that is solely used for representing categorical or ordinal variables. In the medical dataset we are building, we might use a factor to represent gender because it uses two categories: male and female.

Why use factors rather than character vectors? One advantage of factors is that the category labels are stored only once. Rather than storing MALE, MALE, FEMALE, the computer may store 1, 1, 2, which can reduce the memory needed to store the values. Additionally, many machine learning algorithms treat nominal and numeric features differently. Coding categorical variables as factors ensures that R will handle categorical data appropriately.

To create a factor from a character vector, simply apply the factor() function. For example:

> gender <- factor(c("MALE", "FEMALE", "MALE"))> gender[1] MALE   FEMALE MALELevels: FEMALE MALE

Notice that when the gender factor was displayed, R printed additional information about its levels. The levels comprise the set of possible categories the factor could take, in this case, MALE or FEMALE.

When we create factors, we can add additional levels that may not appear in the original data. Suppose we created another factor for blood type, as shown in the following example:

> blood <- factor(c("O", "AB", "A"), 
           levels = c("A", "B", "AB", "O"))> blood[1] O  AB ALevels: A B AB O

When we defined the blood factor, we specified an additional vector of four possible blood types using the levels parameter. As a result, even though our data includes only blood types O, AB, and A, all four types are retained with the blood factor, as the output shows. Storing the additional level allows for the possibility of adding patients with the other blood type in the future. It also ensures that if we were to create a table of blood types, we would know that type B exists, despite it not being found in our initial data.

The factor data structure also allows us to include information about the order of a nominal variable's categories, which provides a method for creating ordinal features. For example, suppose we have data on the severity of patient symptoms, coded in increasing order of severity from mild, to moderate, to severe. We indicate the presence of ordinal data by providing the factor's levels in the desired order, listed ascending from lowest to highest, and setting the ordered parameter to TRUE as shown:

> symptoms <- factor(c("SEVERE", "MILD", "MODERATE"), 
              levels = c("MILD", "MODERATE", "SEVERE"),               ordered = TRUE)

The resulting symptoms factor now includes information about the requested order. Unlike our prior factors, the levels of this factor are separated by < symbols to indicate the presence of a sequential order from MILD to SEVERE:

> symptoms[1] SEVERE   MILD     MODERATELevels: MILD < MODERATE < SEVERE

A helpful feature of ordered factors is that logical tests work as you would expect. For instance, we can test whether each patient's symptoms are more severe than moderate:

> symptoms > "MODERATE"[1]  TRUE FALSE FALSE

Machine learning algorithms capable of modeling ordinal data will expect ordered factors, so be sure to code your data accordingly.

Lists

A list is a data structure, much like a vector, in that it is used for storing an ordered set of elements. However, where a vector requires all its elements to be the same type, a list allows different R data types to be collected. Due to this flexibility, lists are often used to store various types of input and output data and sets of configuration parameters for machine learning models.

To illustrate lists, consider the medical patient dataset we have been constructing, with data for three patients stored in six vectors. If we wanted to display all the data for the first patient, we would need to enter five R commands:

> subject_name[1][1] "John Doe"> temperature[1][1] 98.1> 
flu_status[1][1] FALSE> gender[1][1] MALELevels: FEMALE MALE> 
blood[1][1] OLevels: A B AB O> symptoms[1][1] SEVERELevels: MILD < MODERATE < SEVERE

If we expect to examine the patient's data again in the future, rather than retyping these commands, a list allows us to group all of the values into one object we can use repeatedly.

Similar to creating a vector with c(), a list is created using the list() function as shown in the following example. One notable difference is that when a list is constructed, each component in the sequence should be given a name. The names are not strictly required, but allow the values to be accessed later on by name rather than by numbered position. To create a list with named components for all of the first patient's data, type the following:

> subject1 <- list(fullname = subject_name[1],   
                   temperature = temperature[1],   
                   flu_status = flu_status[1],   
                   gender = gender[1],        
                   blood = blood[1],          
                   symptoms = symptoms[1])

This patient's data is now collected in the subject1 list:

> subject1$fullname[1] "John Doe"$temperature[1] 98.1$flu_status[1] 
  FALSE$gender[1] MALELevels: FEMALE MALE$blood[1] OLevels: 
  A B AB O$symptoms[1] SEVERELevels: MILD < MODERATE < SEVERE

Note that the values are labeled with the names we specified in the preceding command. As a list retains order like a vector, its components can be accessed using numeric positions, as shown here for the temperature value:

> subject1[2]$temperature[1] 98.1

The result of using vector-style operators on a list object is another list object, which is a subset of the original list. For example, the preceding code returned a list with a single temperature component. To instead return a single list item in its native data type, use double brackets ([[ and ]]) when selecting the list component. For example, the following command returns a numeric vector of length one:

> subject1[[2]][1] 98.1

For clarity, it is often better to access list components by name, by appending a $ and the component name to the list name as follows:

> subject1$temperature[1] 98.1

Like the double-bracket notation, this returns the list component in its native data type (in this case, a numeric vector of length one).

It is possible to obtain several list items by specifying a vector of names. The following returns a subset of the subject1 list, which contains only the temperature and flu_status components:

> subject1[c("temperature",
"flu_status")]$temperature[1] 98.1$flu_status[1] FALSE

Entire datasets could be constructed using lists, and lists of lists. For example, you might consider creating a subject2 and subject3 list, and grouping these into a list object named pt_data. However, constructing a dataset in this way is common enough that R provides a specialized data structure specifically for this task.

Data Frames

By far, the most important R data structure utilized in machine learning is the data frame, a structure analogous to a spreadsheet or database in that it has both rows and columns of data. In R terms, a data frame can be understood as a list of vectors or factors, each having exactly the same number of values. Now, because the data frame is literally a list of vector-type objects, it combines aspects of both vectors and lists.

Let's create a data frame for our patient dataset. Using the patient data vectors we created previously, the data.frame() function combines them into a data frame:

> pt_data <- data.frame(subject_name, temperature,  
                        flu_status, gender, blood, symptoms,      
                        stringsAsFactors = FALSE)

You might notice something new in the preceding code. We included an additional parameter: stringsAsFactors = FALSE. If we do not specify this option, R will automatically convert every character vector to a factor.

This feature is occasionally useful, but also sometimes unwarranted. Here, for example, the subject_name field is definitely not categorical data, as names are not categories of values. Therefore, setting the stringsAsFactors option to FALSE allows us to convert character vectors to factors only where it makes sense for the project.

When we display the pt_data data frame, we see that the structure is quite different from the data structures we worked with previously:

> pt_data  subject_name temperature flu_status gender blood symptoms1     John Doe 
       98.1      FALSE   MALE     O   SEVERE2     Jane Doe        98.6 
       FALSE FEMALE    AB     MILD3 Steve Graves       101.4       TRUE   MALE  
       A MODERATE

Compared to the one-dimensional vectors, factors, and lists, a data frame has two dimensions and is displayed in matrix format. This particular data frame has one column for each vector of patient data and one row for each patient. In machine learning terms, the data frame's columns are the features or attributes and the rows are the examples.

To extract entire columns (vectors) of data, we can take advantage of the fact that a data frame is simply a list of vectors. Similar to lists, the most direct way to extract a single element is by referring to it by name. For example, to obtain the subject_name vector, type:

> pt_data$subject_name[1] "John Doe"     "Jane Doe"     "Steve Graves"

Also similar to lists, a vector of names can be used to extract multiple columns from a data frame:

> pt_data[c("temperature", "flu_status")]  temperature flu_status1 
       98.1      FALSE2        98.6      FALSE3       101.4       TRUE

When we request columns in the data frame by name, the result is a data frame containing all rows of data for the specified columns. The command pt_data[2:3] will also extract the temperature and flu_status columns. However, referring to the columns by name results in clear and easy-to-maintain R code that will not break if the data frame is later reordered.

To extract specific values from the data frame, methods like those for accessing values in vectors are used. However, there is an important distinction—because the data frame is two-dimensional, both the desired rows and columns must be specified. Rows are specified first, followed by a comma, followed by the columns in a format like this: [rows, columns]. As with vectors, rows and columns are counted beginning at one.

For instance, to extract the value in the first row and second column of the patient data frame, use the following command:

> pt_data[1, 2][1] 98.1

If you would like more than a single row or column of data, specify vectors indicating the desired rows and columns. The following statement will pull data from the first and third rows and the second and fourth columns:

> pt_data[c(1, 3), c(2, 4)]  temperature gender1        98.1   MALE3       101.4   MALE

To refer to every row or every column, simply leave the row or column portion blank. For example, to extract all rows of the first column:

> pt_data[, 1][1] "John Doe"     "Jane Doe"     "Steve Graves"

To extract all columns for the first row:

> pt_data[1, ]  subject_name temperature flu_status gender blood symptoms1 
    John Doe        98.1      FALSE   MALE     O   SEVERE

And to extract everything:

> pt_data[ , ]  subject_name temperature flu_status gender blood symptoms1 
    John Doe        98.1      FALSE   MALE     O   SEVERE2     Jane Doe   
     98.6      FALSE FEMALE    AB     MILD3 Steve Graves       101.4  
     TRUE   MALE     A MODERATE

Of course, columns are better accessed by name rather than position, and negative signs can be used to exclude rows or columns of data. Therefore, the output of the command:

> pt_data[c(1, 3), c("temperature", "gender")]  temperature gender1  
      98.1   MALE3       101.4   MALE

is equivalent to:

> pt_data[-2, c(-1, -3, -5, -6)]  temperature gender1  
      98.1   MALE3       101.4   MALE

Sometimes it is necessary to create new columns in data frames—perhaps, for instance, as a function of existing columns. For example, we may need to convert the Fahrenheit temperature readings in the patient data frame to the Celsius scale. To do this, we simply use the assignment operator to assign the result of the conversion calculation to a new column name as follows:

> pt_data$temp_c <- (pt_data$temperature - 32) * (5 / 9)

To confirm the calculation worked, let's compare the new Celsius-based temp_c column to the previous Fahrenheit-scale temperature column:

> pt_data[c("temperature", "temp_c")]  temperature   temp_c1 
       98.1 36.722222        98.6 37.000003       101.4 38.55556

Seeing these side by side, we can confirm that the calculation has worked correctly.

To become more familiar with data frames, try practicing similar operations with the patient dataset, or even better, use data from one of your own projects. These types of operations are crucial for much of the work we will do in upcoming chapters.

Matrices and Arrays

In addition to data frames, R provides other structures that store values in tabular form. A matrix is a data structure that represents a two-dimensional table with rows and columns of data. Like vectors, R matrices can contain only one type of data, although they are most often used for mathematical operations and therefore typically store only numbers.

To create a matrix, simply supply a vector of data to the matrix() function, along with a parameter specifying the number of rows (nrow) or number of columns (ncol). For example, to create a 2x2 matrix storing the numbers one to four, we can use the nrow parameter to request the data to be divided into two rows:

> m <- matrix(c(1, 2, 3, 4), nrow = 2)> m     [,1] [,2][1,]    1    3[2,]    2    4

This is equivalent to the matrix produced using ncol = 2:

> m <- matrix(c(1, 2, 3, 4), ncol = 2)> m     [,1] [,2][1,]    1    3[2,]    2    4

You will notice that R loaded the first column of the matrix first before loading the second column. This is called column-major order, which is R's default method for loading matrices.

To illustrate this further, let's see what happens if we add more values to the matrix.

With six values, requesting two rows creates a matrix with three columns:

> m <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 2)> m  
   [,1] [,2] [,3][1,]    1    3    5[2,]    2    4    6

Requesting two columns creates a matrix with three rows:

> m <- matrix(c(1, 2, 3, 4, 5, 6), ncol = 2)> m 
    [,1] [,2][1,]    1    4[2,]    2    5[3,]    3    6

As with data frames, values in matrices can be extracted using [row, column] notation. For instance, m[1, 1] will return the value 1 while m[3, 2] will extract 6 from the m matrix. Additionally, entire rows or columns can be requested:

> m[1, ][1] 1 4> m[, 1][1] 1 2 3

Closely related to the matrix structure is the array, which is a multidimensional table of data. Where a matrix has rows and columns of values, an array has rows, columns, and a number of additional layers of values.