DataFrame
With the limitation of vector, matrix, and list, a data structure suitable for real-world datasets was a much-needed requirement for data science practitioners. DataFrames are an elegant way of storing and retrieving tabular data. We have already seen how DataFrame handles the rows and columns of data in Exercise 3, Reading a JSON File and Storing the Data in DataFrame. DataFrames will be extensively used throughout the book.
Exercise 7: Performing Integrity Checks Using DataFrame
Let's revisit step 6 of Exercise 6, Using the List Method for Storing Integers and Characters Together, where we discussed the integrity check when we attempted to store two unequal length vectors in a list and will see how DataFrame handles it differently. We will, once again, generate random numbers (r_numbers) and random characters (r_characters).
Perform the following steps to complete the exercise:
- First, generate 16 random numbers drawn from a binomial distribution with parameter size equal to 100 and probability of success equal to 0.4:
r_numbers <- rbinom(n = 16, size = 100, prob = 0.4)
- Select any 18 alphabets from English LETTERS without repetition:
r_characters <- sample(LETTERS, 18, FALSE)
- Put r_numbers and r_characters into a single DataFrame:
data.frame(r_numbers, r_characters)
The output is as follows:
Error in data.frame(r_numbers, r_characters) :
arguments imply differing number of rows: 16, 18
As you can see, the error in the previous output shows that the last two LETTERS, that is, P and Q, have no mapping with a corresponding random INTEGER generated using the binomial distribution.
Accessing any particular row and column in the DataFrame is similar to the matrix. We will show many tricks and techniques to best use the power of indexing in the DataFrame, which also includes some of the filtering options.
Every row in a DataFrame is a result of the tightly coupled collection of columns. Each column clearly defines the relationship each row of data has with every other one. If there is no corresponding value available in a column, it will be filled with NA. For example, a customer in a CRM application might not have filled their marital status, whereas a few other customers filled it. So, it becomes essential during application design to specify which columns are mandatory and which are optional.
Data Table
With the growing adaption of DataFrame came a time when its limitations started to surface. Particularly with large datasets, DataFrame performs poorly. In the complex analysis, we often create many intermediate DataFrames to store the results. However, R is built on an in-memory computation architecture, and it heavily depends on RAM. Unlike disk space, RAM is limited to either 4 or 8 GB in many standard desktops and laptops. DataFrame is not built efficiently to manage the memory during the computation, which often results in out of memory error, especially when working with large datasets.
In order to handle this issue, data.table inherited the data.frame functionality and offers fast and memory-efficient version for the following task on top of it:
- File reader and writer
- Aggregations
- Updates
- Equi, non-equi, rolling, range, and interval joins
Efficient memory management makes the development fast and reduces the latency between operations. The following exercise shows the significant difference data.table makes in computation time as compared to data.frame. First, we read the complete Amazon Food Review dataset, which is close to 286 MB and contains half a million records (this is quite a big dataset for R), using the fread() method, which is one of the fast reading methods from data.table.
Exercise 8: Exploring the File Read Operation
In this exercise, we will only show file read operations. You are encouraged to test the other functionalities (https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html) and compare the data table capabilities over DataFrame.
Perform the following steps to complete the exercise:
- First, load the data table package using the following command:
library(data.table)
- Read the dataset using the fread() method of the data.table package:
system.time(fread("Reviews_Full.csv"))
The output is as follows:
Read 14.1% of 568454 rows
Read 31.7% of 568454 rows
Read 54.5% of 568454 rows
Read 72.1% of 568454 rows
Read 79.2% of 568454 rows
Read 568454 rows and 10 (of 10) columns from 0.280 GB file in 00:00:08
## user system elapsed
## 3.62 0.15 3.78
- Now, read the same CSV file using the read.csv() method of base package:
system.time(read.csv("Reviews_Full.csv"))
The output is as follows:
## user system elapsed
## 4.84 0.05 4.91
Observe that 3.78 seconds elapsed for reading it through the fread() method, while the read.csv function took 4.91 seconds. The execution speed is almost 30% faster. As the size of the data increasing, this difference is even more significant.
In the previous output, the user time is the time spent by the current R session, and system time is the time spent by the operating system to complete the process. It's possible that you may get a different value after executing the system.time method even if you use the same dataset. It depends a lot on how busy your CPU was at the time of running the method. However, we should read the output of the system.time method relative to the comparison we are carrying out and not relative to the absolute values.
When the size of the dataset is too large, we have too many intermediate operations to get to the final output. However, keep in mind that data.table is not the magic wand that allows us to deal with a dataset of any size in R. The size of RAM still plays a significant role, and data.table is no substitute for distributed and parallel processing big data systems. However, even for the smaller dataset, the usage of data.table has shown much better performance than data.frames.