![](images/banner.png)
Chapter10 Dataframe
Dataframes are one of the most commonly used objects in R for representing data. Similar to a matrix
, a dataframe
is a two-dimensional structure where data is organized in rows and columns. It can be helpful to think of a data frame just like a regular table you might create in a spreadsheet (e.g., Excel) where you can enter your data. The term “data” here is intentionally broad because, unlike matrices, data frames can contain different types of data within the same object (e.g., names, dates, and numbers).
The basic structure of a dataframe
is thus the same as a matrix, but it allows us to include various data types in the same object, such as characters and numerical values. This enables us to store all the features of statistical units (e.g., numerical, categorical, nominal variables, etc.) that we aim to analyze later. A crucial point is that dataframes are designed to handle complex and heterogeneous data, typical in real data analysis. If you use other programming languages geared toward data analysis (e.g., Matlab), you’ll notice the lack of a powerful and intuitive object like the dataframe
.
Generally, in a dataframe
, rows represent statistical units (e.g., people or observations), and columns represent variables, which are properties measured on those units. However, there are two main dataframe formats depending on how the data is organized: wide dataframes and long dataframes. Let’s evaluate the difference between the two formats by considering data where, for each subject, we measure age, gender, and responses to three items in a questionnaire.
Wide Dataframe
In the wide format, each row of the dataframe
represents a subject, and each of their responses or measured variables is reported in a different column. Even repeated responses to the same type of variable, such as questionnaire items, experimental trials, or the same variable measured at multiple time points are presented in different columns, so that whatever has been collected for a single participant is reported in one and a single row. The wide format is arguably the most familiar to researchers who routinely perform analyses with the SPSS software. In Table 10.1, some example data is presented in wide format: notice how responses to different items are reported in multiple columns.
Id | age | gender | item_1 | item_2 | item_3 |
---|---|---|---|---|---|
subj_1 | 21 | F | 2 | 0 | 2 |
subj_2 | 23 | M | 1 | 2 | 0 |
subj_3 | 19 | F | 1 | 1 | 1 |
We see that each subject is identified by a code in the Id variable, and their responses to the three items are reported in three different variables item_1, item_2, and item_3.
Long Dataframe
In the long format, each row represents a single observation. Therefore, the data for each subject is spread across multiple rows, and variables that don’t change across observations are repeated. In Table 10.3, the example data is presented in long format.
Id | age | gender | item | response |
---|---|---|---|---|
subj_1 | 21 | F | 1 | 2 |
subj_1 | 21 | F | 2 | 1 |
subj_1 | 21 | F | 3 | 1 |
subj_2 | 23 | M | 1 | 0 |
subj_2 | 23 | M | 2 | 2 |
subj_2 | 23 | M | 3 | 1 |
subj_3 | 19 | F | 1 | 2 |
subj_3 | 19 | F | 2 | 0 |
subj_3 | 19 | F | 3 | 1 |
We observe that each subject’s responses are spread across multiple rows. The characteristics that do not vary are repeated in each row (Id, age, and gender), while the responses to the items are recorded using two columns: item, which represents the item number, and response, which records the subject’s actual response to that specific item.
Long and wide data formats have different properties, especially in terms of usage. The type of data and the final result are the same; however, some software or R packages work only with datasets organized in a certain way (e.g., lme4
package for fitting mixed-models and ggplot2
for data visualization mostly require long format dataframes, while lavaan
package for structural equation modeling mostly require wide format dataframes.)
There isn’t a “correct” or “incorrect” format; it depends on the type of analysis and the software or package you’re using. Some operations or analyses require the dataset in long format, while others require it in wide format.
However, it’s advisable to become accustomed to thinking in long format because most modern data analysis and visualization packages require the data in this structure. There are also functions for quickly switching from one format to another (e.g., the reshape()
function).
Note that in the previous examples, we used both numeric and character columns. This wasn’t possible with matrices. However, keep in mind that, just like matrices, dataframes require that all columns (variables) have the same number of elements.
Now we’ll see how to create dataframes in R, how to perform common selection operations, and describe some simple manipulations with dataframes. As we’ll see, there are many similarities between using dataframes and matrices. When necessary, we’ll refer to the previous chapter to highlight common aspects and differences between these two data structures.
10.1 Creating a dataframe
In most cases, you will probably upload existing data (e.g., from a .csv
file containing some collected experimental data) rather than creating a dataframe from scratch in R. However, let’s now see how to create a new dataframe. The command to create a dataframe is data.frame()
:
Creating a dataframe is slightly different from matrices. Intuitively, it’s like combining several columns (where one may contain names, another dates, and so on) rather than inserting values by row or column as in matrices. In fact, to create a dataframe
, you need to specify the columns one by one, indicating column_name = values
within the data.frame()
command. Let’s see an example in R:
my_data = data.frame(
Id = c(1:5),
names = c("Alice", "Bruno", "Carla", "Diego", "Elisa"),
gender = factor(c("F","M","F", "M", "F")),
age = c(22, 25, 23, 22, 24),
faculty = factor(c("Psychology", "Engineering", "Medicine", "Arts", "Psychology"))
)
my_data
## Id names gender age faculty
## 1 1 Alice F 22 Psychology
## 2 2 Bruno M 25 Engineering
## 3 3 Carla F 23 Medicine
## 4 4 Diego M 22 Arts
## 5 5 Elisa F 24 Psychology
In this case, we’ve created a hypothetical dataframe where each row represents a subject, and each column represents a certain characteristic of that subject, like gender, age, and so on.
It’s always recommended to define a column (e.g., Id
) where you assign a unique identifier to each subject. This allows you to identify them without risking errors. For example, using a name (or even first and last names) may not be ideal because multiple people could have the same name, making it difficult to distinguish between them. While this may not seem strictly needed when using wide dataframes, as different rows always represent different individuals, that becomes crucial when using long dataframes, as multiple rows may or may not represent data from the same individual.
A character variable within a DataFrame is considered by default as a simple sequence of characters. You can specify the argument stringsAsFactors = TRUE
to automatically convert all character variables to categorical variables (see Chapter 8
Note that the default behavior depends on the version of R. Versions prior to R 4.0 had stringsAsFactors = TRUE
by default, while from R 4.0 onwards, the default is stringsAsFactors = FALSE
. So, pay attention when using code written before version 4.0.
10.2 Selecting Elements
To select one or more values from a dataframe, you can use row and column indices within square brackets, similar to matrices:
Remember that the order [<row-index>, <column-index>]
is fixed and must be followed for the selection to work correctly. You can perform various types of selections depending on your needs, using the same procedures seen for matrices. For example, using the my_data
dataframe created earlier, we can select:
my_data
## Id names gender age faculty
## 1 1 Alice F 22 Psychology
## 2 2 Bruno M 25 Engineering
## 3 3 Carla F 23 Medicine
## 4 4 Diego M 22 Arts
## 5 5 Elisa F 24 Psychology
## 6 6 Fabrizio M 35 Arts
## 7 7 Gloria F 26 Engineering
## 8 8 Herman M 20 Psychology
## 9 9 Irene F 23 Statistics
## 10 10 Luca M 22 Engineering
# Value in 3rd row and 2nd column
my_data[3,4]
## [1] 23
# All variables of the 1st and 3rd rows
my_data[c(2,3), ]
## Id names gender age faculty
## 2 2 Bruno M 25 Engineering
## 3 3 Carla F 23 Medicine
# All values of the 5th variable
my_data[ , 5]
## [1] Psychology Engineering Medicine Arts Psychology Arts
## [7] Engineering Psychology Statistics Engineering
## Levels: Arts Engineering Medicine Psychology Statistics
# Values of the 2nd and 4th variables for the 3rd and 5th rows
my_data[c(3,5), c(2,4)]
## names age
## 3 Carla 23
## 5 Elisa 24
However, when working with dataframes, it’s more common to refer to the column names and use logical conditions to select values of interest.
Selecting Columns by Variable Names
One significant difference between matrices and dataframes is that you can access columns using the $
operator and the column name, using the following syntax:
This allows you to access that specific column directly without using indices and square brackets. For example:
# Select the "names" variable
my_data$names
## [1] "Alice" "Bruno" "Carla" "Diego" "Elisa" "Fabrizio"
## [7] "Gloria" "Herman" "Irene" "Luca"
# Select the "faculty" variable
my_data$faculty
## [1] Psychology Engineering Medicine Arts Psychology Arts
## [7] Engineering Psychology Statistics Engineering
## Levels: Arts Engineering Medicine Psychology Statistics
Alternatively, you can use the usual square bracket selection method by specifying the column names instead of column indices. This allows you to select multiple variables at once. For example:
# Select only the "names" variable
my_data[ , "names"]
## [1] "Alice" "Bruno" "Carla" "Diego" "Elisa" "Fabrizio"
## [7] "Gloria" "Herman" "Irene" "Luca"
# Select the "names", "gender", and "faculty" variables
my_data[ , c("names", "gender", "faculty")]
## names gender faculty
## 1 Alice F Psychology
## 2 Bruno M Engineering
## 3 Carla F Medicine
## 4 Diego M Arts
## 5 Elisa F Psychology
## 6 Fabrizio M Arts
## 7 Gloria F Engineering
## 8 Herman M Psychology
## 9 Irene F Statistics
## 10 Luca M Engineering
Note that the variable names must be provided as strings.
Selecting Rows with Logical Conditions
We’ve seen earlier how to construct logical propositions to select only values that meet a certain condition in the case of vectors and matrices. Now, this function becomes particularly useful as it allows us to query the dataframe in a simple and intuitive way. By using a logical condition, we can filter the observations that satisfy a given condition and retrieve only the information of interest.
In the canonical syntax [<row-index>, <column-index>]
, the row indices are replaced with a logical condition to filter the rows, and the column indices are replaced with the names of the variables of interest. We use the following syntax:
Here are some examples of selection:
# All data for "Diego" (Id == 4)
my_data[my_data$Id == 4 , ]
## Id names gender age faculty
## 4 4 Diego M 22 Arts
# All data for females
my_data[my_data$gender == "F" , ]
## Id names gender age faculty
## 1 1 Alice F 22 Psychology
## 3 3 Carla F 23 Medicine
## 5 5 Elisa F 24 Psychology
## 7 7 Gloria F 26 Engineering
## 9 9 Irene F 23 Statistics
# Faculties of subjects older than 24 years
my_data[my_data$age > 24 , c("age", "faculty")]
## age faculty
## 2 25 Engineering
## 6 35 Arts
## 7 26 Engineering
Note that when defining a logical condition using the dataframe’s variables, it’s still necessary to include the dataframe name. In the previous example, we would have encountered an error by simply using age > 24
, as this would refer to the object age
(which doesn’t exist), not the age
variable contained in my_data
.
Similarly to what we saw with vectors, using the condition my_data$age > 24
returns a vector of TRUE
and FALSE
values depending on whether the condition is met or not.
By using logical operators AND (&
), OR (|
), and NOT (!
), we can combine multiple logical operations to create more complex, yet still intuitive, indexing queries. For example, to select “subjects between 20 and 25 years old enrolled in psychology,” we would write:
my_data[my_data$age>=20 & my_data$age<=25 & my_data$faculty=="Psychology" , ]
## Id names gender age faculty
## 1 1 Alice F 22 Psychology
## 5 5 Elisa F 24 Psychology
## 8 8 Herman M 20 Psychology
By using this indexing method, we can appreciate the true power of dataframes. We now have a very simple and intuitive method for working with complex data structures that contain different data types.
Two important aspects regarding the result of a selection are the type of output obtained and how to save it.
Output Type
Similarly to matrices, selection commands do not always return the same type of object. When we select a single variable, the result is a vector, while selecting two or more variables returns a dataframe.
Saving the Selection
As with other types of objects, selection operations do not modify the original object. Therefore, if you wish to keep the changes, you need to save the result of the selection. It’s advisable to create a new object and not overwrite the original one, as this will allow you to make new selections or operations on the original data. It’s good practice to always keep one copy of the dataframe containing the original version of the data.
10.2.1 Advanced Selection Uses
Let’s now look at some advanced uses of selecting elements from a dataframe.
Modifying Elements
Similar to other objects, we can modify values by selecting the old value in the dataframe and using the =
(or <-
) operator to assign the new value.
my_data[1:5 , ]
## Id names gender age faculty
## 1 1 Alice F 22 Psychology
## 2 2 Bruno M 25 Engineering
## 3 3 Carla F 23 Medicine
## 4 4 Diego M 22 Arts
## 5 5 Elisa F 24 Psychology
# Replace the name in the 4th row with "Davide"
my_data[4 , "names"] = "Davide"
my_data[1:5 , ]
## Id names gender age faculty
## 1 1 Alice F 22 Psychology
## 2 2 Bruno M 25 Engineering
## 3 3 Carla F 23 Medicine
## 4 4 Davide M 22 Arts
## 5 5 Elisa F 24 Psychology
Remove Rows or Columns
Similarly to matrices, to remove rows (or columns) from a dataframe
, you need to indicate the row (or column) indices to remove within square brackets, preceded by the -
(minus) operator.
# Remove the first 5 observations/rows
my_data[-c(1:5) , ]
## Id names gender age faculty
## 6 6 Fabrizio M 35 Arts
## 7 7 Gloria F 26 Engineering
## 8 8 Herman M 20 Psychology
## 9 9 Irene F 23 Statistics
## 10 10 Luca M 22 Engineering
# Remove the third column
my_data[ , -3]
## Id names age faculty
## 1 1 Alice 22 Psychology
## 2 2 Bruno 25 Engineering
## 3 3 Carla 23 Medicine
## 4 4 Davide 22 Arts
## 5 5 Elisa 24 Psychology
## 6 6 Fabrizio 35 Arts
## 7 7 Gloria 26 Engineering
## 8 8 Herman 20 Psychology
## 9 9 Irene 23 Statistics
## 10 10 Luca 22 Engineering
You can also exclude (and in a sense remove) information using logical operators such as NOT (i.e., !
) and not equal to (!=
). In the following examples, only rows where the faculty
information is not equal to (i.e., different from) Psychology
are kept (so, observations associated with Psychology
are practically removed from the dataframe):
# Select everything except psychology students
my_data[my_data$faculty != "Psychology" , ]
## Id names gender age faculty
## 2 2 Bruno M 25 Engineering
## 3 3 Carla F 23 Medicine
## 4 4 Davide M 22 Arts
## 6 6 Fabrizio M 35 Arts
## 7 7 Gloria F 26 Engineering
## 9 9 Irene F 23 Statistics
## 10 10 Luca M 22 Engineering
Removing multiple categories, such as both Psychology
and Statistics
while keeping the rest may be a bit more difficult. Here are two possible methods, the second one may be more efficient when writing:
# method 1 : keep rows where "faculty" is not "Psychology" AND not "Statistics"
my_data[my_data$faculty != "Psychology" & my_data$faculty != "Statistics" , ]
## Id names gender age faculty
## 2 2 Bruno M 25 Engineering
## 3 3 Carla F 23 Medicine
## 4 4 Davide M 22 Arts
## 6 6 Fabrizio M 35 Arts
## 7 7 Gloria F 26 Engineering
## 10 10 Luca M 22 Engineering
# method 2 : keep all rows but NOT those in the vector c("Psychology", "Statistics")
my_data[!my_data$faculty %in% c("Psychology","Statistics") , ]
## Id names gender age faculty
## 2 2 Bruno M 25 Engineering
## 3 3 Carla F 23 Medicine
## 4 4 Davide M 22 Arts
## 6 6 Fabrizio M 35 Arts
## 7 7 Gloria F 26 Engineering
## 10 10 Luca M 22 Engineering
A strategy similar to the second method shown just above could be used for removing multiple columns by names. Actually, the -
(minus) will not work when listing columns by name. Let’s say we want to visualize the my_data
dataframe but removing the columns related to gender
and age
. We may proceed as follow:
# Select all column names that are NOT in the vector c("gender","age")
my_data[ , !colnames(my_data) %in% c("gender","age")]
## Id names faculty
## 1 1 Alice Psychology
## 2 2 Bruno Engineering
## 3 3 Carla Medicine
## 4 4 Davide Arts
## 5 5 Elisa Psychology
## 6 6 Fabrizio Arts
## 7 7 Gloria Engineering
## 8 8 Herman Psychology
## 9 9 Irene Statistics
## 10 10 Luca Engineering
Note that the “removal” operation is actually a “selection” operation. Therefore, if you want to keep the changes, you need to save the result.
Using the -
operator is somewhat risky, especially if the object created (or overwritten) is then used in other operations. Removing information, unless strictly necessary, is not a good practice. If you need to select part of the data, it’s always better to create a new dataframe (or a new object in general) while keeping a version of the original data available.
Exercises
Referring to the data_long
and data_wide
dataframes created earlier, complete the following exercises
(solutions):
- Using numeric indices for rows and columns, select the data for the subject
subj_2
regarding theitem
andresponse
variables from thedata_long
dataframe. - Perform the same selection as the previous exercise, but this time use a logical condition for the row indices and specify the column names directly.
- In the
data_wide
dataframe, select theId
andgender
variables for subjects who responded with 1 to theitem_1
variable. - In the
data_long
dataframe, select only the data for girls older than 20 years. - Remove the observations regarding
subj_2
and the"gender"
variable from thedata_long
dataframe.
10.3 Functions and Operations
Now, let’s look at some commonly used functions and operations performed with dataframes (see Table 10.5).
Function | Description |
---|---|
nrow(nome_df) | Number of observations in the dataframe |
ncol(nome_df) | Number of variables in the dataframe |
colnames(nome_df) | Column names of the dataframe |
rownames(nome_df) | Row names of the dataframe |
nome_df <-<br> cbind(nome_df, dati) <br> nome_df$nome_var <-<br> dati | Add a new variable to the dataframe (must have the same number of rows) |
nome_df <- rbind(nome_df, dati) | Add observations (new data must match the structure of the dataframe) |
head(nome_df) | First rows of the dataframe |
tail(nome_df) | Last rows of the dataframe |
str(nome_df) | Structure of the dataframe |
summary(nome_df) | Summary of the dataframe |
We will now describe in detail some particular uses, considering an example of a reduced version of the my_data
dataframe created earlier.
data_short = my_data[1:5, ]
data_short
## Id names gender age faculty
## 1 1 Alice F 22 Psychology
## 2 2 Bruno M 25 Engineering
## 3 3 Carla F 23 Medicine
## 4 4 Davide M 22 Arts
## 5 5 Elisa F 24 Psychology
10.3.1 Attributes of a Dataframe
We saw in Chapter 8.1 that objects in R possess what are called attributes, useful information about the object itself, akin to metadata. We will now, similarly to matrices, evaluate the dimensions of a dataframe and the names of its rows and columns.
Dimensions
Recall that a dataframe is a two-dimensional object consisting of rows and columns. To obtain the number of rows and columns in a dataframe, we can use the nrow()
and ncol()
functions, respectively.
Alternatively, as with matrices, we can use the dim()
function, which returns a vector with two values: the first representing the number of rows and the second the number of columns.
Row and Column Names
Similarly to matrices, you can access the row and column names using the rownames()
and colnames()
functions, respectively. By default, dataframes require names only for the columns, while rows are assigned names based on their row indices. However, it’s possible to name rows with arbitrary values, although this is rarely used.
# Check the current names
rownames(data_short)
## [1] "1" "2" "3" "4" "5"
colnames(data_short)
## [1] "Id" "names" "gender" "age" "faculty"
To set row and/or column names, assign a character vector of the same length as the dimension you’re renaming to rownames(dataframe_name)
or colnames(dataframe_name)
.
# Assign names to the rows
rownames(data_short) = paste0("Subj_", 1:nrow(data_short))
# Assign names to the columns
colnames(data_short) = c("Id", "Name", "Gender", "Age", "Faculty")
data_short
## Id Name Gender Age Faculty
## Subj_1 1 Alice F 22 Psychology
## Subj_2 2 Bruno M 25 Engineering
## Subj_3 3 Carla F 23 Medicine
## Subj_4 4 Davide M 22 Arts
## Subj_5 5 Elisa F 24 Psychology
Finally, note that the names()
function for dataframes is analogous to colnames()
, and you can use NULL
to remove, for example, row names.
10.3.2 Merging Dataframes
Similarly to matrices, you can merge multiple dataframes using the cbind()
and rbind()
functions, with the same dimensionality requirements regarding rows and columns. However, for dataframes, you can also create a new column using the $
operator. Let’s now describe these uses in detail.
dataframe$name = new_var
With the syntax dataframe$name = new_var
, we add a new column named name
to the dataframe, taking values from new_var
. This new variable must have the same number of values as there are rows in the dataframe.
# Add the "average" column
data_short$Average = c(27.5, 23.6, 28.3, 29.2, 24.8)
# note that this is the same as:
average_grades = c(27.5, 23.6, 28.3, 29.2, 24.8)
data_short$Average = average_grades
data_short
## Id Name Gender Age Faculty Average
## 1 1 Alice F 22 Psychology 27.5
## 2 2 Bruno M 25 Engineering 23.6
## 3 3 Carla F 23 Medicine 28.3
## 4 4 Davide M 22 Arts 29.2
## 5 5 Elisa F 24 Psychology 24.8
cbind()
With the cbind()
function, we can add one or more variables to our dataframe. Note that, unlike using the $
operator, the result is not automatically saved; it must be assigned to a new object dataframe = cbind(dataframe, new_var)
. In this case, the column name will be new_var. If we want to rename the column, we can use the syntax cbind(dataframe, "name" = new_var)
or name the object directly:
# Add the "number_of_exams" variable
number_of_exams = c(12, 14, 13, 10, 8)
cbind(data_short, number_of_exams) # without specifying the name
## Id Name Gender Age Faculty Average number_of_exams
## 1 1 Alice F 22 Psychology 27.5 12
## 2 2 Bruno M 25 Engineering 23.6 14
## 3 3 Carla F 23 Medicine 28.3 13
## 4 4 Davide M 22 Arts 29.2 10
## 5 5 Elisa F 24 Psychology 24.8 8
cbind(data_short, "N_exams" = number_of_exams) # specifying the name
## Id Name Gender Age Faculty Average N_exams
## 1 1 Alice F 22 Psychology 27.5 12
## 2 2 Bruno M 25 Engineering 23.6 14
## 3 3 Carla F 23 Medicine 28.3 13
## 4 4 Davide M 22 Arts 29.2 10
## 5 5 Elisa F 24 Psychology 24.8 8
Again, the new variable must have the same number of values as there are rows in the dataframe.
rbind()
Adding rows to a dataframe is slightly more complex. You may need to do this for different reasons. For example, you may have collected experimental data from different participants in different .csv
files, and now you have to import and combine them in a single dataframe including all data from all participants. Unlike matrices, which by default do not have column names and contain only numbers or strings, to merge two dataframes by row, they must have the same structure, meaning:
- The same number of columns (as with matrices)
- The same column names
In the following example, we simply add one single row to an existing dataframe:
data_short
## Id Name Gender Age Faculty Average
## 1 1 Alice F 22 Psychology 27.5
## 2 2 Bruno M 25 Engineering 23.6
## 3 3 Carla F 23 Medicine 28.3
## 4 4 Davide M 22 Arts 29.2
## 5 5 Elisa F 24 Psychology 24.8
# New data with the same columns, but named differently
new_data = data.frame(
Id = 6,
Name = "Marta",
Sex = "F", # note "Sex" instead of "Gender"
Age = 44,
Faculty = "Philosophy",
Average = 28.7
)
new_data
## Id Name Sex Age Faculty Average
## 1 6 Marta F 44 Philosophy 28.7
rbind(data_short, new_data) # Error!
## Error in match.names(clabs, names(xi)): names do not match previous names
# New data with the same columns having the exact same names
new_data = data.frame(
Id = 6,
Name = "Marta",
Gender = "F",
Age = 44,
Faculty = "Philosophy",
Average = 28.7
)
rbind(data_short, new_data)
## Id Name Gender Age Faculty Average
## 1 1 Alice F 22 Psychology 27.5
## 2 2 Bruno M 25 Engineering 23.6
## 3 3 Carla F 23 Medicine 28.3
## 4 4 Davide M 22 Arts 29.2
## 5 5 Elisa F 24 Psychology 24.8
## 6 6 Marta F 44 Philosophy 28.7
Again, you need to save the result to keep the changes (which we have not done in the above example).
10.3.3 Dataframe Information
Finally, let’s look at some very common functions used to get summary information about the data contained in a dataframe:
head()
(ortail()
) allows us to view the first (or last) rows of our dataframe:
head(my_data)
## Id names gender age faculty
## 1 1 Alice F 22 Psychology
## 2 2 Bruno M 25 Engineering
## 3 3 Carla F 23 Medicine
## 4 4 Davide M 22 Arts
## 5 5 Elisa F 24 Psychology
## 6 6 Fabrizio M 35 Arts
tail(my_data)
## Id names gender age faculty
## 5 5 Elisa F 24 Psychology
## 6 6 Fabrizio M 35 Arts
## 7 7 Gloria F 26 Engineering
## 8 8 Herman M 20 Psychology
## 9 9 Irene F 23 Statistics
## 10 10 Luca M 22 Engineering
str()
lets us evaluate the structure of the dataset, providing useful information such as the number of observations, the number of variables, and the type of variables:
str(my_data)
## 'data.frame': 10 obs. of 5 variables:
## $ Id : int 1 2 3 4 5 6 7 8 9 10
## $ names : chr "Alice" "Bruno" "Carla" "Davide" ...
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 1 2 1 2 1 2
## $ age : num 22 25 23 22 24 35 26 20 23 22
## $ faculty: Factor w/ 5 levels "Arts","Engineering",..: 4 2 3 1 4 1 2 4 5 2
summary()
gives us summary information about the variables depending on their type:
summary(my_data)
## Id names gender age faculty
## Min. : 1.00 Length:10 F:5 Min. :20.00 Arts :2
## 1st Qu.: 3.25 Class :character M:5 1st Qu.:22.00 Engineering:3
## Median : 5.50 Mode :character Median :23.00 Medicine :1
## Mean : 5.50 Mean :24.20 Psychology :3
## 3rd Qu.: 7.75 3rd Qu.:24.75 Statistics :1
## Max. :10.00 Max. :35.00
Exercises
Referring to the data_long
and data_wide
dataframes (solutions):
- Add the numerical variable
"memory_pre"
to both thedata_wide
anddata_long
dataframes:
## Id memory_pre
## 1 subj_1 3
## 2 subj_2 2
## 3 subj_3 1
- Add the categorical variable
"group"
to both thedata_wide
anddata_long
dataframes:
## Id group
## 1 subj_1 treatment
## 2 subj_2 treatment
## 3 subj_3 control
- Aggiungi al DataFrame
data_wide
i dati del soggettosubj_4
esubj_5
.
## Id age gender item_1 item_2 item_3 memory_pre group
## 1 subj_4 25 F 1 0 2 1 treatment
## 2 subj_5 22 M 1 1 0 3 control
- In the
data_wide
dataframe, calculate the"memory_post"
variable as the sum of the items. - In the
data_long
dataframe, rename the variablesitem_1
,item_2
, anditem_3
toproblem_1
,problem_2
, andproblem_3
, respectively.