Chapter 8 Big data
It is well accepted that the meaning of big data is context specific and evolving over time (e.g., De Mauro et al., 2016; Fox, 2018). In his 1998 talk at the USENIX conference, John Mashey discussed the problems and future of big data (Mashey, 1998). At that time, a desktop with 256MB of random-access memory (RAM; also called physical memory or just memory) and 16GB (1GB=1000MB or 1024MB) of hard drive (or disk drive, or just disk) was considered to be a “monster” machine costing more than $3,000. In July 2018, a laptop with 8GB of RAM and 1000GB of hard drive may cost less than $400 (e.g., a Lenovo A12-9720P). Clearly, a “big” data set in 1998 could be a “small” data set in 2018. Thus, big data should not be defined simply by the size of the data but should be considered as “data sets that are so big and complex that traditional data-processing application software is inadequate to deal with them” (Wikipedia).
R conducts all the operations directly in RAM, and thus is not well-suited for working with data larger than 10-20% of the computer RAM (R Manual). The usage of memory in R can increase dramatically when conducting calculations.
8.1 Memory management in R
When R starts, it sets up a workspace. The size of the workspace changes according to the data and operations. Anything in the workspace can be viewed as an R object, being a number, vector or matrix. The objects can be fixed-sized or variable sized. In R, these are called Ncells
and Vcells
.
For example, the function gc()
can tell us the memory usage information. The last column of Vcells
shows the maximum space/memory used.
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 12605253 673.2 31275138 1670.3 31275138 1670.3
## Vcells 2469717191 18842.5 3842207646 29313.8 3842207009 29313.8
gc(reset=TRUE)
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 12604823 673.2 31275138 1670.3 12604823 673.2
## Vcells 2469716482 18842.5 3842207646 29313.8 2469716482 18842.5
The R function object.size
can be used to calculate the size of an R object.
a <- rnorm(10000)
object.size(a)
## 80048 bytes
format(object.size(a), 'Kb')
## [1] "78.2 Kb"
format(object.size(a), 'Mb')
## [1] "0.1 Mb"
We can log the use of memory in R, too, which is called memory profiling. This can be done using the R package profmem
.
library(profmem)
profmem(
{
a <- array(rnorm(1000), dim=c(100,10))
b <- array(rnorm(1000), dim=c(100,10))
a + b
}
)
## Rprofmem memory profiling of:
## {
## a <- array(rnorm(1000), dim = c(100, 10))
## b <- array(rnorm(1000), dim = c(100, 10))
## a + b
## }
##
## Memory allocations:
## what bytes calls
## 1 alloc 8048 array() -> rnorm()
## 2 alloc 2552 array() -> rnorm()
## 3 alloc 8048 array()
## 4 alloc 8048 array() -> rnorm()
## 5 alloc 2552 array() -> rnorm()
## 6 alloc 8048 array()
## 7 alloc 8048 <internal>
## total 45344
8.2 Investigate the memory usage of R
We first define a function to output the size of an R object.
msize <- function(x){
format(object.size(x), units = "auto")
}
Now, let’s generate a data set with 1,000 subjects and 100 variables. The data set takes about 781.5 Kb memory.
N <- 1000
P <- 100
y <- array(rnorm(N*P), dim=c(N,P))
msize(y)
## [1] "781.5 Kb"
If we are interested in the variance and covariance matrix of the data, we will have a 100 by 100 matrix that takes about 78.3 Kb of memory.
cov.y <- cov(y)
msize(cov.y)
## [1] "78.3 Kb"
The covariance matrix is symmetric and there are 100*(100+1)/2 = 5,050 estimated values. In statistical inference, we may need to get the covariance matrix of these 5,050 values. Then this will result in a matrix of the size 5,050 by 5,050. The matrix is related to the outer product of the sample covariance matrix. Note that such a matrix already takes 762.9 Mb of memory.
cov.cov.y <- cov.y %o% cov.y
msize(cov.cov.y)
## [1] "762.9 Mb"
Now consider another example. In a study, I have data on 45 variables from more than 9 million participants. Since I cannot share the data, I generate a data set instead of using the following code.
x <- array(rnorm(9300000*44), dim=c(9300000, 44))
y <- x%*%c(0.8, 0.5, rep(0, 42))
y <- y + rnorm(9300000)
dset <- cbind(y, x)
write.table(round(dset,3), "bigdata1.txt", quote='', col.names=F, row.names=F)
The data file itself takes 2.49Gb disk space on the hard drive. It took 255.64 seconds to read the whole data set into R and the R matrix takes 3.1 Gb of memory.
system.time(bigdata <- read.table('D:/data/bigdata1.txt'))
## user system elapsed
## 120.02 4.47 124.86
names(bigdata) <- c('y', paste0('x', 1:44))
msize(bigdata)
## [1] "3.1 Gb"
Since a file can be very large and can be difficult to read all together or even open, it is useful to take a look at the first part of the data. This can be done using R function readLines
which can read data line by line.
# read the first 3 lines of the file
readLines('D:/data/bigdata1.txt', n=10)
## [1] "0.078 0.758 -0.694 -0.947 -2.581 -0.149 -0.2 1.964 0.326 0.992 0.881 1.62 2.207 0.372 0.38 1.351 1.971 0.474 0.03 0.959 -0.675 -0.024 -0.102 -1.448 -1.158 0.01 -1.184 -0.572 0.52 -0.608 -0.547 0.598 -0.567 1.471 -1.067 -0.452 0.916 0.379 -0.943 -1.19 -0.086 -1.576 -0.6 1.452 0.189"
## [2] "2.698 2.123 0.131 -0.918 1.051 -0.701 -0.302 1.344 0.328 -0.49 -0.837 -0.658 0.95 0.123 1.191 -0.272 0.086 1.179 0.707 -0.891 -1.135 1.082 -0.332 -0.474 -0.974 -1.049 0.685 -0.203 -1.67 -0.783 -1.205 -1.23 0.5 -0.522 -0.007 1.652 0.104 -0.489 -0.823 -0.013 0.324 0.257 1.557 -0.645 0.619"
## [3] "-0.504 0.357 -1.48 -0.426 -1.107 1.407 -0.847 1.979 0.23 0.673 -0.601 0.242 1.543 0.206 0.241 -0.095 0.432 -0.243 0.265 2.195 2.175 -0.394 1.12 -0.834 2.391 2.097 -1.66 1.927 -0.349 -0.882 -0.1 0.382 -1.084 0.136 0.331 0.413 0.103 -0.609 0.964 1.03 -0.007 0.879 -0.585 -0.194 0.485"
## [4] "-0.886 -2.327 -0.48 1.08 1.882 -0.323 -0.913 -0.307 0.594 0.397 -0.522 -0.312 -3.415 -0.882 0.653 1.569 -1.102 0.391 0.364 -0.396 0.822 1.373 -1.385 1.044 1.144 -3.073 1.086 -0.516 0.31 -0.425 2.31 0.366 -0.739 0.434 -1.526 -0.196 -0.982 -1.372 -2.034 -0.48 0.102 -2.111 0.036 1.363 0.696"
## [5] "2.454 0.216 1.52 0.504 -0.471 -0.04 -0.333 0.446 0.105 0.041 -1.003 -0.684 -0.877 1.505 -1.629 -0.786 0.789 -0.139 0.798 1.873 -0.223 -1.333 0.846 -0.604 -2.525 0.082 -0.675 -0.807 -0.939 -0.258 -0.024 0.141 -0.008 0.897 0.661 -0.503 1.255 -1.544 1.516 -0.702 0.281 -0.85 0.171 -0.182 -1.084"
## [6] "1.52 0.877 0.769 -0.747 -0.044 -2.161 -0.106 -0.936 0.631 2.725 -0.075 1.933 0.722 0.68 -1.868 -2.974 -0.279 -1.296 1.544 0.992 0.385 0.222 0.13 0.757 0.011 0.187 -1.113 -1.004 0.104 0.542 0.393 -1.25 -0.277 0.155 0.188 -0.271 0.623 0.226 0.786 2.519 -0.126 -0.134 2.008 -0.168 -0.428"
## [7] "-1.585 -1.163 0.143 0.218 -0.964 -0.767 1.322 0.884 1.729 0.266 -0.149 -2.203 -0.095 0.223 -0.822 1.915 -2.257 0.972 1.799 1.462 0.228 0.828 -1.316 0.17 -1.131 0.917 0.81 1.576 -1.008 -1.492 1.514 -0.711 -0.591 1.267 -0.917 -0.729 0.891 -0.476 -1.021 0.868 1.45 1.482 1.648 -0.551 -2.253"
## [8] "-1.961 -2.201 -0.469 0.621 -1.164 0.163 -1.664 -2.007 -0.743 -0.755 1.337 0.132 0.96 0.588 -0.787 1.223 2.55 0.472 -0.299 1.223 0.142 -1.267 2.004 0.144 0.819 1.792 -0.647 -0.887 0.399 -0.31 0.828 0.339 0.036 -0.77 -0.809 -1.145 0.238 -1.091 0.844 -0.105 -0.159 -1.74 0.029 0.462 2.003"
## [9] "3.825 1.77 2.539 -2.003 0.648 -2.177 1.254 -0.297 0.919 -0.323 -1.848 0.515 0.009 0.726 -0.912 -2.591 -0.341 -1.609 1.289 0.66 1.33 0.297 0.057 0.489 0.589 -0.228 -0.59 0.763 -0.904 -1.477 2.413 -0.822 -0.695 -0.266 -0.737 -0.036 -0.67 1.591 -1.311 -0.188 -0.22 -0.416 -1.697 -0.433 -1.031"
## [10] "0.656 1.849 -2.221 -0.062 2.957 1.688 -1.318 1.223 0.747 -1.027 1.256 -0.16 0.842 1.61 -0.244 -0.435 -0.548 -0.352 0.885 0.046 1.796 -1.782 1.647 0.67 0.746 0.64 2.618 -0.037 0.506 -2.157 0.307 0.543 0.379 -0.31 -1.676 -0.052 -2.249 1.828 1.473 -0.445 -1.338 -0.812 0.493 1.182 0.374"
For the data, we can fit a regression model.
gc(reset = TRUE)
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 12604862 673.2 31275138 1670.3 12604862 673.2
## Vcells 2469819313 18843.3 3842207646 29313.8 2469819313 18843.3
system.time(lm.model <- lm(y ~ ., data=bigdata))
## user system elapsed
## 31.28 5.81 37.10
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 3305513 176.6 25020110 1336.3 12622298 674.2
## Vcells 2441919986 18630.4 4610729175 35177.1 3841579592 29309.0
summary(lm.model)
##
## Call:
## lm(formula = y ~ ., data = bigdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.5016 -0.6746 0.0000 0.6746 4.9972
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.661e-04 3.278e-04 -0.812 0.41695
## x1 8.003e-01 3.279e-04 2440.811 < 2e-16 ***
## x2 4.999e-01 3.278e-04 1525.104 < 2e-16 ***
## x3 -4.739e-04 3.279e-04 -1.445 0.14838
## x4 -3.020e-04 3.279e-04 -0.921 0.35701
## x5 -4.545e-04 3.278e-04 -1.387 0.16558
## x6 -1.948e-04 3.278e-04 -0.594 0.55242
## x7 -2.488e-04 3.278e-04 -0.759 0.44785
## x8 3.362e-04 3.278e-04 1.026 0.30506
## x9 4.538e-04 3.278e-04 1.384 0.16632
## x10 -5.771e-04 3.278e-04 -1.760 0.07833 .
## x11 -1.368e-04 3.278e-04 -0.417 0.67647
## x12 -1.840e-04 3.279e-04 -0.561 0.57464
## x13 3.161e-04 3.278e-04 0.964 0.33488
## x14 -2.598e-04 3.277e-04 -0.793 0.42797
## x15 -3.631e-04 3.279e-04 -1.107 0.26811
## x16 1.382e-04 3.278e-04 0.422 0.67337
## x17 2.026e-04 3.279e-04 0.618 0.53661
## x18 5.563e-05 3.278e-04 0.170 0.86526
## x19 3.374e-05 3.279e-04 0.103 0.91805
## x20 -9.403e-04 3.279e-04 -2.867 0.00414 **
## x21 -1.072e-04 3.277e-04 -0.327 0.74359
## x22 5.049e-04 3.279e-04 1.540 0.12365
## x23 -2.200e-04 3.278e-04 -0.671 0.50212
## x24 4.885e-04 3.278e-04 1.490 0.13610
## x25 -1.226e-04 3.278e-04 -0.374 0.70846
## x26 -4.829e-05 3.278e-04 -0.147 0.88287
## x27 -5.505e-04 3.278e-04 -1.680 0.09302 .
## x28 7.486e-05 3.279e-04 0.228 0.81940
## x29 -6.634e-04 3.279e-04 -2.023 0.04308 *
## x30 -1.978e-04 3.278e-04 -0.604 0.54617
## x31 7.914e-04 3.278e-04 2.414 0.01576 *
## x32 -1.750e-04 3.279e-04 -0.534 0.59352
## x33 9.689e-05 3.279e-04 0.295 0.76764
## x34 -1.413e-04 3.278e-04 -0.431 0.66649
## x35 5.080e-04 3.278e-04 1.550 0.12123
## x36 -2.867e-04 3.280e-04 -0.874 0.38215
## x37 -2.032e-04 3.278e-04 -0.620 0.53528
## x38 -1.623e-04 3.277e-04 -0.495 0.62039
## x39 2.997e-04 3.278e-04 0.914 0.36056
## x40 3.403e-06 3.278e-04 0.010 0.99172
## x41 2.188e-04 3.279e-04 0.667 0.50448
## x42 1.713e-04 3.279e-04 0.522 0.60140
## x43 8.022e-05 3.279e-04 0.245 0.80676
## x44 7.804e-04 3.278e-04 2.381 0.01728 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9998 on 9299955 degrees of freedom
## Multiple R-squared: 0.471, Adjusted R-squared: 0.471
## F-statistic: 1.882e+05 on 44 and 9299955 DF, p-value: < 2.2e-16
The whole analysis took about 33 seconds. It increased the memory usage from 10.3Gb to 23.38Gb, which means the analysis would use about 13Gb of memory. The model results, lm.model, use about 8.2Gb of memory. On computers without large enough memory, errors would occur.
msize(lm.model)
## [1] "8.2 Gb"
8.3 Ways for handling big data
Many different ways can be used to handle big data. For example, one can use more powerful computers with more memory and more powerful CPUs to analyze big data. One can also use better algorithms that require less computer memory. Here we consider the following methods: (1) use sparse matrix, (2) use file-backed memory management, and (3) use a divide-and-conquer algorithm.
8.3.1 Use of the sparse matrix
In general, a matrix with many zeros is called a sparse matrix and, otherwise, a dense matrix. For example, see the matrix below. It contains only 9 nonzero elements, with 26 zero elements. Its sparsity is 26/35 = 74%, and its density is 9/35 = 26%.
mat1 <- array(c(11,0,0,0,0, 22,33,0,0,0, 0,44,55,0,0,
0,0,66,0,0, 0,0,77,0,0, 0,0,0,88,0,
0,0,0,0,99), dim=c(5,7))
mat1
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 11 22 0 0 0 0 0
## [2,] 0 33 44 0 0 0 0
## [3,] 0 0 55 66 77 0 0
## [4,] 0 0 0 0 0 88 0
## [5,] 0 0 0 0 0 0 99
mat1 <- array(c(0,11,0,0,0, 22,33,0,0,0, 0,44,55,0,0,
0,0,66,0,0, 0,0,77,0,0, 0,0,0,88,0,
0,0,0,0,99), dim=c(5,7))
For a sparse matrix, we can usually store the information more efficiently such as using the coordinate form. Basically, we can create three vectors for a matrix with two vectors storing the row and column numbers and the third vector storing the non-zero values. This is also referred as a triplet form. For example, for the matrix above,
mat2 <- Matrix(mat1, sparse=T)
str(mat2)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## ..@ i : int [1:9] 1 0 1 1 2 2 2 3 4
## ..@ p : int [1:8] 0 1 3 5 6 7 8 9
## ..@ Dim : int [1:2] 5 7
## ..@ Dimnames:List of 2
## .. ..$ : NULL
## .. ..$ : NULL
## ..@ x : num [1:9] 11 22 33 44 55 66 77 88 99
## ..@ factors : list()
summary(mat2)
## 5 x 7 sparse Matrix of class "dgCMatrix", with 9 entries
## i j x
## 1 2 1 11
## 2 1 2 22
## 3 2 2 33
## 4 2 3 44
## 5 3 3 55
## 6 3 4 66
## 7 3 5 77
## 8 4 6 88
## 9 5 7 99
Another way to store a sparse matrix is called compressed sparse column format, which also represents a matrix by three vectors. One vector contains all the nonzero values. The second vector contains the row index of the non-zero elements in the original matrix. The third vector is of the length of the number of columns + 1. The first element is always 0. Each subsequent value is equal to the previous valus plus the number of non-zero values on that column.
In R, a sparse matrix can be created using the R package Matrix
.
For a matrix with p rows and q columns with K non-zero values, if K < pq/3, the coordinate form would save memory. In general, the use of a sparse matrix is more efficient for big matrices.
For a quick comparison, we compare the R base function matrix
and the Matrix
package function Matrix
.
m1 <- matrix(0, nrow = 1000, ncol = 1000)
m2 <- Matrix(0, nrow = 1000, ncol = 1000, sparse = TRUE)
msize(m1)
## [1] "7.6 Mb"
msize(m2)
## [1] "5.6 Kb"
Note that the full representation of a perfectly sparse matrix using base R would utilize many more times of memory. We can further check how much more memory would be required if both matrices had exactly one non-zero entry:
m1[500, 500] <- 1
m2[500, 500] <- 1
diag(m1) <- 1
diag(m2) <- 1
msize(m1)
## [1] "7.6 Mb"
msize(m2)
## [1] "17.3 Kb"
The full matrix representation does not change in size because all of the zeros are being represented explicitly, while the sparse matrix is conserving that space by representing only the non-zero entries.
8.3.2 An example
The data set bigdata2.txt
includes data from 9,300,000 subjects on 45 variables.
system.time(bigdata2 <- read.table('D:/data/bigdata2.txt'))
## user system elapsed
## 90.95 7.11 98.41
bigdata2 <- as.matrix(bigdata2)
msize(bigdata2)
## [1] "3.1 Gb"
We can change this to a sparse matrix.
bigdata2.sparse <- Matrix(bigdata2, sparse = T)
msize(bigdata2.sparse)
## [1] "340.4 Mb"
Not all R functions can directly work with a sparse matrix.
m2.fit <- lm(bigdata2.sparse[,1] ~ bigdata2.sparse[, 2:45])
But some R packages can take advantage of the sparse matrix. For example, the R package glmnet
can be used to fit a regression model.
library(glmnet)
m2.fit <- glmnet(bigdata2.sparse[, 2:45], bigdata2.sparse[,1], lambda=0)
coef(m2.fit)
## 45 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 3.098768e-04
## V2 8.024610e-01
## V3 5.001773e-01
## V4 1.075327e-03
## V5 -5.854417e-04
## V6 8.307421e-04
## V7 -2.621908e-03
## V8 6.415586e-04
## V9 -9.141844e-04
## V10 -1.176226e-03
## V11 -1.183520e-03
## V12 4.622483e-04
## V13 1.741654e-03
## V14 -1.482702e-03
## V15 -1.896171e-03
## V16 -1.451331e-04
## V17 8.380330e-04
## V18 -2.516776e-03
## V19 -7.393954e-04
## V20 -2.009022e-03
## V21 -2.782236e-03
## V22 8.836950e-04
## V23 1.107031e-03
## V24 -1.866159e-03
## V25 3.227316e-05
## V26 -7.867384e-04
## V27 1.069368e-03
## V28 -3.771603e-03
## V29 1.294955e-03
## V30 4.267471e-04
## V31 2.350611e-04
## V32 3.168892e-03
## V33 4.555078e-04
## V34 1.212845e-03
## V35 -2.297700e-03
## V36 -2.296558e-04
## V37 -4.655091e-04
## V38 -1.637043e-03
## V39 8.900071e-04
## V40 5.755523e-04
## V41 -3.812868e-04
## V42 -6.247250e-04
## V43 1.154913e-03
## V44 2.159603e-03
## V45 -8.821923e-04
8.3.3 File-backed memory management - Memory mapping
Memory mapping is based on the use of virtual memory. The virtual memory of a computer can be much greater than its RAM. Memory mapping associates a segment of virtual memory with the data file on the hard drive. In this way, it does not need to read the entire content of a data file into RAM but the part of data that are immediately needed. After the use of the part of the data, they are removed from RAM to allow another part of data to be used. This method is illustrated in the figure below. Note that the virtual memory can be as big as the disk space.
The R package bigmemory
(Kane et al., 2013) uses the idea for data management. The package defines a new data structure/type called big.matrix for numeric matrices which uses memory-mapped files to allow matrices to exceed the RAM size. The underlying technology is memory mapping on modern operating systems that associates a segment of virtual memory in a one-to-one correspondence with the contents of a file. These files are accessed at a much faster speed than in the database approaches because operations are handled at the operating-system level.
The function read.big.matrix
can be used to read the data. The function only supports one type of data - double, integer, short (large integer number), and char. It creates a pointer to the file created on the computer. Therefore, the object itself is extremely small.
b.names <- c('y', paste0('x', 1:44))
system.time(bigdata <- read.big.matrix('D:/data/bigdata1.txt', header=FALSE, type="double", sep=" ",
col.names=b.names, backingpath = "D:/data/", backingfile="bigdata.back",
descriptorfile="bigdata.desc"))
## user system elapsed
## 333.81 12.36 457.55
msize(bigdata)
## [1] "696 bytes"
describe(bigdata)
## An object of class "big.matrix.descriptor"
## Slot "description":
## $sharedType
## [1] "FileBacked"
##
## $filename
## [1] "bigdata.back"
##
## $dirname
## [1] "D:/data//"
##
## $totalRows
## [1] 9300000
##
## $totalCols
## [1] 45
##
## $rowOffset
## [1] 0 9300000
##
## $colOffset
## [1] 0 45
##
## $nrow
## [1] 9300000
##
## $ncol
## [1] 45
##
## $rowNames
## NULL
##
## $colNames
## [1] "y" "x1" "x2" "x3" "x4" "x5" "x6" "x7" "x8" "x9" "x10" "x11"
## [13] "x12" "x13" "x14" "x15" "x16" "x17" "x18" "x19" "x20" "x21" "x22" "x23"
## [25] "x24" "x25" "x26" "x27" "x28" "x29" "x30" "x31" "x32" "x33" "x34" "x35"
## [37] "x36" "x37" "x38" "x39" "x40" "x41" "x42" "x43" "x44"
##
## $type
## [1] "double"
##
## $separated
## [1] FALSE
Many basic matrix operations can be used here.
dim(bigdata)
## [1] 9300000 45
head(bigdata)
## y x1 x2 x3 x4 x5 x6 x7 x8 x9
## [1,] 0.078 0.758 -0.694 -0.947 -2.581 -0.149 -0.200 1.964 0.326 0.992
## [2,] 2.698 2.123 0.131 -0.918 1.051 -0.701 -0.302 1.344 0.328 -0.490
## [3,] -0.504 0.357 -1.480 -0.426 -1.107 1.407 -0.847 1.979 0.230 0.673
## [4,] -0.886 -2.327 -0.480 1.080 1.882 -0.323 -0.913 -0.307 0.594 0.397
## [5,] 2.454 0.216 1.520 0.504 -0.471 -0.040 -0.333 0.446 0.105 0.041
## [6,] 1.520 0.877 0.769 -0.747 -0.044 -2.161 -0.106 -0.936 0.631 2.725
## x10 x11 x12 x13 x14 x15 x16 x17 x18 x19
## [1,] 0.881 1.620 2.207 0.372 0.380 1.351 1.971 0.474 0.030 0.959
## [2,] -0.837 -0.658 0.950 0.123 1.191 -0.272 0.086 1.179 0.707 -0.891
## [3,] -0.601 0.242 1.543 0.206 0.241 -0.095 0.432 -0.243 0.265 2.195
## [4,] -0.522 -0.312 -3.415 -0.882 0.653 1.569 -1.102 0.391 0.364 -0.396
## [5,] -1.003 -0.684 -0.877 1.505 -1.629 -0.786 0.789 -0.139 0.798 1.873
## [6,] -0.075 1.933 0.722 0.680 -1.868 -2.974 -0.279 -1.296 1.544 0.992
## x20 x21 x22 x23 x24 x25 x26 x27 x28 x29
## [1,] -0.675 -0.024 -0.102 -1.448 -1.158 0.010 -1.184 -0.572 0.520 -0.608
## [2,] -1.135 1.082 -0.332 -0.474 -0.974 -1.049 0.685 -0.203 -1.670 -0.783
## [3,] 2.175 -0.394 1.120 -0.834 2.391 2.097 -1.660 1.927 -0.349 -0.882
## [4,] 0.822 1.373 -1.385 1.044 1.144 -3.073 1.086 -0.516 0.310 -0.425
## [5,] -0.223 -1.333 0.846 -0.604 -2.525 0.082 -0.675 -0.807 -0.939 -0.258
## [6,] 0.385 0.222 0.130 0.757 0.011 0.187 -1.113 -1.004 0.104 0.542
## x30 x31 x32 x33 x34 x35 x36 x37 x38 x39
## [1,] -0.547 0.598 -0.567 1.471 -1.067 -0.452 0.916 0.379 -0.943 -1.190
## [2,] -1.205 -1.230 0.500 -0.522 -0.007 1.652 0.104 -0.489 -0.823 -0.013
## [3,] -0.100 0.382 -1.084 0.136 0.331 0.413 0.103 -0.609 0.964 1.030
## [4,] 2.310 0.366 -0.739 0.434 -1.526 -0.196 -0.982 -1.372 -2.034 -0.480
## [5,] -0.024 0.141 -0.008 0.897 0.661 -0.503 1.255 -1.544 1.516 -0.702
## [6,] 0.393 -1.250 -0.277 0.155 0.188 -0.271 0.623 0.226 0.786 2.519
## x40 x41 x42 x43 x44
## [1,] -0.086 -1.576 -0.600 1.452 0.189
## [2,] 0.324 0.257 1.557 -0.645 0.619
## [3,] -0.007 0.879 -0.585 -0.194 0.485
## [4,] 0.102 -2.111 0.036 1.363 0.696
## [5,] 0.281 -0.850 0.171 -0.182 -1.084
## [6,] -0.126 -0.134 2.008 -0.168 -0.428
tail(bigdata)
## y x1 x2 x3 x4 x5 x6 x7 x8 x9
## [1,] 0.729 -0.339 0.661 -2.108 -0.216 -1.969 -0.513 -1.534 1.916 0.128
## [2,] 3.097 0.852 0.994 -0.845 -0.890 -1.676 0.501 -0.982 0.352 0.180
## [3,] -2.146 -0.734 -1.307 -1.279 0.263 -0.522 1.943 -0.078 0.315 1.020
## [4,] -0.429 -0.912 -0.266 -0.498 0.268 0.308 2.106 -0.485 -0.725 0.087
## [5,] 0.076 0.249 0.638 -2.386 0.982 0.436 -1.811 0.944 0.863 1.285
## [6,] -0.589 -1.130 -1.445 -2.108 -0.604 -0.339 -0.428 0.211 1.997 -0.872
## x10 x11 x12 x13 x14 x15 x16 x17 x18 x19
## [1,] -0.982 0.252 0.171 0.983 -0.562 2.201 -0.767 0.377 0.023 -1.258
## [2,] -1.083 1.152 0.662 -1.605 -0.038 1.721 0.244 -1.419 -1.391 1.383
## [3,] -0.036 -0.152 -1.717 1.417 -0.110 -0.681 -0.365 -1.846 0.115 0.998
## [4,] 1.295 1.012 -0.369 -0.538 -0.990 0.702 -1.941 -0.385 0.070 -0.366
## [5,] 0.404 1.326 0.882 1.211 0.090 -2.496 1.409 -1.835 -0.706 -1.323
## [6,] -0.206 1.149 1.074 0.756 0.425 0.368 -0.372 0.420 -0.182 -0.825
## x20 x21 x22 x23 x24 x25 x26 x27 x28 x29
## [1,] 0.132 0.672 -1.082 0.548 -1.142 -1.172 -0.339 -0.359 -0.717 0.565
## [2,] 1.152 1.036 0.164 -1.465 1.693 0.049 0.100 0.652 0.154 -0.754
## [3,] -1.378 -0.021 0.454 -0.660 1.496 0.124 -0.746 -2.241 0.104 -0.916
## [4,] 1.422 1.110 1.276 0.106 0.207 0.670 0.174 0.430 0.195 -0.206
## [5,] -0.002 0.633 0.264 -1.761 -0.850 0.375 0.532 0.495 0.568 0.707
## [6,] -0.089 1.030 -0.477 0.164 0.512 0.743 0.549 0.029 -0.404 -0.908
## x30 x31 x32 x33 x34 x35 x36 x37 x38 x39
## [1,] 1.115 2.267 0.990 -1.244 -1.005 0.679 0.437 -1.957 -1.598 0.586
## [2,] 0.208 -0.981 -1.784 -0.096 -0.461 -0.079 -1.562 -2.081 1.325 -1.201
## [3,] 0.536 1.913 -2.255 -0.083 -1.504 1.194 -1.493 -0.283 -1.039 0.147
## [4,] -1.599 0.977 -0.139 0.613 -0.264 0.498 1.153 -1.270 -2.702 -0.960
## [5,] -1.196 0.815 0.402 -1.129 1.084 0.315 -0.153 -0.317 0.123 1.279
## [6,] -0.289 0.109 0.364 0.190 -0.362 0.040 -0.753 0.482 0.660 -1.514
## x40 x41 x42 x43 x44
## [1,] 0.058 -1.502 -1.122 -0.306 2.818
## [2,] -0.356 -0.505 -1.318 0.301 -0.873
## [3,] 0.111 0.064 -0.531 0.877 0.621
## [4,] -0.217 0.721 -1.785 -0.684 -0.007
## [5,] -0.578 -0.241 -1.507 -2.037 -0.404
## [6,] -0.748 0.213 -0.320 0.172 -0.956
summary(bigdata)
## min max mean NAs
## y -6.986000e+00 6.772000e+00 -8.578753e-05 0.000000e+00
## x1 -5.203000e+00 4.978000e+00 1.032937e-04 0.000000e+00
## x2 -5.543000e+00 5.328000e+00 1.940722e-04 0.000000e+00
## x3 -5.152000e+00 5.002000e+00 4.110796e-04 0.000000e+00
## x4 -5.264000e+00 5.042000e+00 -3.704714e-04 0.000000e+00
## x5 -5.150000e+00 5.489000e+00 4.879371e-04 0.000000e+00
## x6 -5.218000e+00 5.176000e+00 3.849446e-04 0.000000e+00
## x7 -5.313000e+00 5.072000e+00 1.582546e-04 0.000000e+00
## x8 -4.980000e+00 5.306000e+00 -1.795616e-04 0.000000e+00
## x9 -5.335000e+00 5.447000e+00 -1.382226e-05 0.000000e+00
## x10 -5.328000e+00 5.118000e+00 -4.109939e-04 0.000000e+00
## x11 -4.945000e+00 4.991000e+00 1.364146e-04 0.000000e+00
## x12 -5.518000e+00 5.150000e+00 -2.684892e-04 0.000000e+00
## x13 -5.211000e+00 5.338000e+00 -3.417510e-04 0.000000e+00
## x14 -5.496000e+00 5.679000e+00 -1.890670e-04 0.000000e+00
## x15 -5.423000e+00 5.223000e+00 -1.623486e-04 0.000000e+00
## x16 -5.700000e+00 5.650000e+00 5.047738e-04 0.000000e+00
## x17 -5.133000e+00 5.235000e+00 -1.196059e-04 0.000000e+00
## x18 -5.071000e+00 4.996000e+00 4.349341e-04 0.000000e+00
## x19 -5.436000e+00 5.314000e+00 -1.393484e-05 0.000000e+00
## x20 -5.657000e+00 4.959000e+00 -2.316873e-04 0.000000e+00
## x21 -5.113000e+00 5.440000e+00 -3.070917e-04 0.000000e+00
## x22 -5.102000e+00 5.376000e+00 2.296577e-04 0.000000e+00
## x23 -4.998000e+00 5.271000e+00 1.018771e-04 0.000000e+00
## x24 -5.439000e+00 5.368000e+00 3.427799e-04 0.000000e+00
## x25 -4.996000e+00 5.060000e+00 8.001932e-04 0.000000e+00
## x26 -5.858000e+00 5.363000e+00 4.890148e-04 0.000000e+00
## x27 -5.434000e+00 5.569000e+00 1.631769e-04 0.000000e+00
## x28 -5.417000e+00 5.465000e+00 -8.578226e-05 0.000000e+00
## x29 -4.922000e+00 5.061000e+00 -1.692409e-05 0.000000e+00
## x30 -5.384000e+00 5.352000e+00 -9.349516e-05 0.000000e+00
## x31 -5.164000e+00 5.440000e+00 -1.631786e-04 0.000000e+00
## x32 -5.180000e+00 5.326000e+00 3.498729e-04 0.000000e+00
## x33 -5.239000e+00 5.015000e+00 2.694239e-04 0.000000e+00
## x34 -5.254000e+00 5.289000e+00 1.387848e-04 0.000000e+00
## x35 -5.249000e+00 5.246000e+00 2.275359e-04 0.000000e+00
## x36 -5.501000e+00 5.337000e+00 -8.486939e-04 0.000000e+00
## x37 -5.070000e+00 5.160000e+00 4.457989e-05 0.000000e+00
## x38 -5.597000e+00 4.912000e+00 -5.178778e-04 0.000000e+00
## x39 -5.336000e+00 5.226000e+00 -2.357135e-04 0.000000e+00
## x40 -5.405000e+00 5.145000e+00 -1.154631e-04 0.000000e+00
## x41 -5.107000e+00 5.251000e+00 -3.038226e-05 0.000000e+00
## x42 -5.432000e+00 5.296000e+00 -3.469746e-04 0.000000e+00
## x43 -4.999000e+00 5.262000e+00 -2.306269e-05 0.000000e+00
## x44 -4.954000e+00 5.032000e+00 4.638249e-04 0.000000e+00
The backing file is only created once. For example, once the data are read using read.big.matrix
, we can further access the data in the following way. First, we can read the basic information from the .desc
file. Then, we can attach the data in R using the attach.big.matrix
function. Note that this takes almost no time.
# Read the pointer from disk .
bigdata.desc <- dget("D:/data/bigdata.desc")
# Attach to the pointer in RAM.
bigdata1 <- attach.big.matrix(bigdata.desc)
The R package biganalytics
can be used to fit regression for big data.
library(biganalytics)
gc(reset = TRUE)
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 12606600 673.3 25020110 1336.3 12606600 673.3
## Vcells 2051326260 15650.4 4610729175 35177.1 2051326260 15650.4
big.lm.2 <- biglm.big.matrix(y ~ x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+
x11+x12+x13+x14+x15+x16+x17+x18+x19+
x20+x21+x22+x23+x24+x25+x26+x27+x28+x29+
x30+x31+x32+x33+x34+x35+x36+x37+x38+x39+
x40+x41+x42+x43+x44, data = bigdata)
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 12609403 673.5 25020110 1336.3 25020110 1336.3
## Vcells 2051334182 15650.5 4610729175 35177.1 3957173940 30190.9
summary(big.lm.2)
## Large data regression model: biglm(formula = formula, data = data, ...)
## Sample size = 9300000
## Coef (95% CI) SE p
## (Intercept) -0.0003 -0.0009 0.0004 3e-04 0.4169
## x1 0.8003 0.7996 0.8009 3e-04 0.0000
## x2 0.4999 0.4993 0.5006 3e-04 0.0000
## x3 -0.0005 -0.0011 0.0002 3e-04 0.1484
## x4 -0.0003 -0.0010 0.0004 3e-04 0.3570
## x5 -0.0005 -0.0011 0.0002 3e-04 0.1656
## x6 -0.0002 -0.0009 0.0005 3e-04 0.5524
## x7 -0.0002 -0.0009 0.0004 3e-04 0.4479
## x8 0.0003 -0.0003 0.0010 3e-04 0.3051
## x9 0.0005 -0.0002 0.0011 3e-04 0.1663
## x10 -0.0006 -0.0012 0.0001 3e-04 0.0783
## x11 -0.0001 -0.0008 0.0005 3e-04 0.6765
## x12 -0.0002 -0.0008 0.0005 3e-04 0.5746
## x13 0.0003 -0.0003 0.0010 3e-04 0.3349
## x14 -0.0003 -0.0009 0.0004 3e-04 0.4280
## x15 -0.0004 -0.0010 0.0003 3e-04 0.2681
## x16 0.0001 -0.0005 0.0008 3e-04 0.6734
## x17 0.0002 -0.0005 0.0009 3e-04 0.5366
## x18 0.0001 -0.0006 0.0007 3e-04 0.8653
## x19 0.0000 -0.0006 0.0007 3e-04 0.9181
## x20 -0.0009 -0.0016 -0.0003 3e-04 0.0041
## x21 -0.0001 -0.0008 0.0005 3e-04 0.7436
## x22 0.0005 -0.0002 0.0012 3e-04 0.1237
## x23 -0.0002 -0.0009 0.0004 3e-04 0.5021
## x24 0.0005 -0.0002 0.0011 3e-04 0.1361
## x25 -0.0001 -0.0008 0.0005 3e-04 0.7085
## x26 0.0000 -0.0007 0.0006 3e-04 0.8829
## x27 -0.0006 -0.0012 0.0001 3e-04 0.0930
## x28 0.0001 -0.0006 0.0007 3e-04 0.8194
## x29 -0.0007 -0.0013 0.0000 3e-04 0.0431
## x30 -0.0002 -0.0009 0.0005 3e-04 0.5462
## x31 0.0008 0.0001 0.0014 3e-04 0.0158
## x32 -0.0002 -0.0008 0.0005 3e-04 0.5935
## x33 0.0001 -0.0006 0.0008 3e-04 0.7676
## x34 -0.0001 -0.0008 0.0005 3e-04 0.6665
## x35 0.0005 -0.0001 0.0012 3e-04 0.1212
## x36 -0.0003 -0.0009 0.0004 3e-04 0.3821
## x37 -0.0002 -0.0009 0.0005 3e-04 0.5353
## x38 -0.0002 -0.0008 0.0005 3e-04 0.6204
## x39 0.0003 -0.0004 0.0010 3e-04 0.3606
## x40 0.0000 -0.0007 0.0007 3e-04 0.9917
## x41 0.0002 -0.0004 0.0009 3e-04 0.5045
## x42 0.0002 -0.0005 0.0008 3e-04 0.6014
## x43 0.0001 -0.0006 0.0007 3e-04 0.8068
## x44 0.0008 0.0001 0.0014 3e-04 0.0173
8.3.4 The ff
and ffbase
packages
The ff
package works similarly like the bigmemory
package but allows the mixture of different data types. The package ffbase
extended ff
with additive functions.
When using ff
, it saves some information in a temporary directory.
# Get the temporary directory
getOption("fftempdir")
## [1] "C:/Users/zzhang4/AppData/Local/Temp/RtmpWe2ofB"
# Set new temporary directory
options(fftempdir = "D:/data")
We now again read the bigdata1.txt. Note that mulitple temporary files were created.
bigdata2 = read.table.ffdf(file="D:/data/bigdata1.txt", # File Name
sep=" ", # Tab separator is used
header=FALSE, # No variable names are included in the file
fill = TRUE, # Missing values are represented by NA
)
Basic operation can be used like the bigmemory package.
names(bigdata2) <- c('y', paste0('x', 1:44))
We save the ff
data and then fast read it in the future.
ffsave(bigdata2,
file="D:/data/bigdata1.ff",
rootpath="D:/data")
## [1] " adding: ffdf2600195a3c0c.ff (172 bytes security) (deflated 69%)"
## [2] " adding: ffdf26005754a92.ff (172 bytes security) (deflated 70%)"
## [3] " adding: ffdf26002f6e2b5a.ff (172 bytes security) (deflated 70%)"
## [4] " adding: ffdf260050ba159d.ff (172 bytes security) (deflated 70%)"
## [5] " adding: ffdf260056e14795.ff (172 bytes security) (deflated 70%)"
## [6] " adding: ffdf260067384811.ff (172 bytes security) (deflated 70%)"
## [7] " adding: ffdf260019511d0.ff (172 bytes security) (deflated 70%)"
## [8] " adding: ffdf260066176ea5.ff (172 bytes security) (deflated 70%)"
## [9] " adding: ffdf260078b578a7.ff (172 bytes security) (deflated 70%)"
## [10] " adding: ffdf260037b129ff.ff (172 bytes security) (deflated 70%)"
## [11] " adding: ffdf26004fe54817.ff (172 bytes security) (deflated 70%)"
## [12] " adding: ffdf2600183c1d2a.ff (172 bytes security) (deflated 70%)"
## [13] " adding: ffdf26001c5db93.ff (172 bytes security) (deflated 70%)"
## [14] " adding: ffdf26004daa2045.ff (172 bytes security) (deflated 70%)"
## [15] " adding: ffdf2600153d4653.ff (172 bytes security) (deflated 70%)"
## [16] " adding: ffdf260016cdb78.ff (172 bytes security) (deflated 70%)"
## [17] " adding: ffdf2600339525f0.ff (172 bytes security) (deflated 70%)"
## [18] " adding: ffdf260015bf2e2d.ff (172 bytes security) (deflated 70%)"
## [19] " adding: ffdf2600fe0532e.ff (172 bytes security) (deflated 70%)"
## [20] " adding: ffdf26003981467c.ff (172 bytes security) (deflated 70%)"
## [21] " adding: ffdf260038ce4518.ff (172 bytes security) (deflated 70%)"
## [22] " adding: ffdf26006ae2ce6.ff (172 bytes security) (deflated 70%)"
## [23] " adding: ffdf260067474a53.ff (172 bytes security) (deflated 70%)"
## [24] " adding: ffdf2600a22847.ff (172 bytes security) (deflated 70%)"
## [25] " adding: ffdf260033ac3cac.ff (172 bytes security) (deflated 70%)"
## [26] " adding: ffdf26003ed81e06.ff (172 bytes security) (deflated 70%)"
## [27] " adding: ffdf2600306012f3.ff (172 bytes security) (deflated 70%)"
## [28] " adding: ffdf26006591128f.ff (172 bytes security) (deflated 70%)"
## [29] " adding: ffdf2600d87450c.ff (172 bytes security) (deflated 70%)"
## [30] " adding: ffdf260012ba6e03.ff (172 bytes security) (deflated 70%)"
## [31] " adding: ffdf26002a0d3643.ff (172 bytes security) (deflated 70%)"
## [32] " adding: ffdf26007b40392d.ff (172 bytes security) (deflated 70%)"
## [33] " adding: ffdf260035ea19e0.ff (172 bytes security) (deflated 70%)"
## [34] " adding: ffdf26003f7546b9.ff (172 bytes security) (deflated 70%)"
## [35] " adding: ffdf260049a55fb.ff (172 bytes security) (deflated 70%)"
## [36] " adding: ffdf26005eb67ca0.ff (172 bytes security) (deflated 70%)"
## [37] " adding: ffdf260017172892.ff (172 bytes security) (deflated 70%)"
## [38] " adding: ffdf2600e4b51e7.ff (172 bytes security) (deflated 70%)"
## [39] " adding: ffdf2600177252d9.ff (172 bytes security) (deflated 70%)"
## [40] " adding: ffdf2600778e5489.ff (172 bytes security) (deflated 70%)"
## [41] " adding: ffdf26005a824ed1.ff (172 bytes security) (deflated 70%)"
## [42] " adding: ffdf2600214f4bb2.ff (172 bytes security) (deflated 70%)"
## [43] " adding: ffdf26005b32331c.ff (172 bytes security) (deflated 70%)"
## [44] " adding: ffdf26008784a2d.ff (172 bytes security) (deflated 70%)"
## [45] " adding: ffdf26007d512910.ff (172 bytes security) (deflated 70%)"
To load the data into R, we use
ffload(file="D:/data/bigdata1.ff",
overwrite = TRUE)
## [1] "ffdf2600195a3c0c.ff" "ffdf26005754a92.ff" "ffdf26002f6e2b5a.ff"
## [4] "ffdf260050ba159d.ff" "ffdf260056e14795.ff" "ffdf260067384811.ff"
## [7] "ffdf260019511d0.ff" "ffdf260066176ea5.ff" "ffdf260078b578a7.ff"
## [10] "ffdf260037b129ff.ff" "ffdf26004fe54817.ff" "ffdf2600183c1d2a.ff"
## [13] "ffdf26001c5db93.ff" "ffdf26004daa2045.ff" "ffdf2600153d4653.ff"
## [16] "ffdf260016cdb78.ff" "ffdf2600339525f0.ff" "ffdf260015bf2e2d.ff"
## [19] "ffdf2600fe0532e.ff" "ffdf26003981467c.ff" "ffdf260038ce4518.ff"
## [22] "ffdf26006ae2ce6.ff" "ffdf260067474a53.ff" "ffdf2600a22847.ff"
## [25] "ffdf260033ac3cac.ff" "ffdf26003ed81e06.ff" "ffdf2600306012f3.ff"
## [28] "ffdf26006591128f.ff" "ffdf2600d87450c.ff" "ffdf260012ba6e03.ff"
## [31] "ffdf26002a0d3643.ff" "ffdf26007b40392d.ff" "ffdf260035ea19e0.ff"
## [34] "ffdf26003f7546b9.ff" "ffdf260049a55fb.ff" "ffdf26005eb67ca0.ff"
## [37] "ffdf260017172892.ff" "ffdf2600e4b51e7.ff" "ffdf2600177252d9.ff"
## [40] "ffdf2600778e5489.ff" "ffdf26005a824ed1.ff" "ffdf2600214f4bb2.ff"
## [43] "ffdf26005b32331c.ff" "ffdf26008784a2d.ff" "ffdf26007d512910.ff"
model_formula = as.formula(paste0("y ~", paste0(paste0('x',1:44), collapse="+")))
model_out = bigglm(model_formula, data=bigdata2)
summary(model_out)
## Large data regression model: bigglm(model_formula, data = bigdata2)
## Sample size = 9300000
## Coef (95% CI) SE p
## (Intercept) -0.0003 -0.0009 0.0004 3e-04 0.4169
## x1 0.8003 0.7996 0.8009 3e-04 0.0000
## x2 0.4999 0.4993 0.5006 3e-04 0.0000
## x3 -0.0005 -0.0011 0.0002 3e-04 0.1484
## x4 -0.0003 -0.0010 0.0004 3e-04 0.3570
## x5 -0.0005 -0.0011 0.0002 3e-04 0.1656
## x6 -0.0002 -0.0009 0.0005 3e-04 0.5524
## x7 -0.0002 -0.0009 0.0004 3e-04 0.4479
## x8 0.0003 -0.0003 0.0010 3e-04 0.3051
## x9 0.0005 -0.0002 0.0011 3e-04 0.1663
## x10 -0.0006 -0.0012 0.0001 3e-04 0.0783
## x11 -0.0001 -0.0008 0.0005 3e-04 0.6765
## x12 -0.0002 -0.0008 0.0005 3e-04 0.5746
## x13 0.0003 -0.0003 0.0010 3e-04 0.3349
## x14 -0.0003 -0.0009 0.0004 3e-04 0.4280
## x15 -0.0004 -0.0010 0.0003 3e-04 0.2681
## x16 0.0001 -0.0005 0.0008 3e-04 0.6734
## x17 0.0002 -0.0005 0.0009 3e-04 0.5366
## x18 0.0001 -0.0006 0.0007 3e-04 0.8653
## x19 0.0000 -0.0006 0.0007 3e-04 0.9181
## x20 -0.0009 -0.0016 -0.0003 3e-04 0.0041
## x21 -0.0001 -0.0008 0.0005 3e-04 0.7436
## x22 0.0005 -0.0002 0.0012 3e-04 0.1237
## x23 -0.0002 -0.0009 0.0004 3e-04 0.5021
## x24 0.0005 -0.0002 0.0011 3e-04 0.1361
## x25 -0.0001 -0.0008 0.0005 3e-04 0.7085
## x26 0.0000 -0.0007 0.0006 3e-04 0.8829
## x27 -0.0006 -0.0012 0.0001 3e-04 0.0930
## x28 0.0001 -0.0006 0.0007 3e-04 0.8194
## x29 -0.0007 -0.0013 0.0000 3e-04 0.0431
## x30 -0.0002 -0.0009 0.0005 3e-04 0.5462
## x31 0.0008 0.0001 0.0014 3e-04 0.0158
## x32 -0.0002 -0.0008 0.0005 3e-04 0.5935
## x33 0.0001 -0.0006 0.0008 3e-04 0.7676
## x34 -0.0001 -0.0008 0.0005 3e-04 0.6665
## x35 0.0005 -0.0001 0.0012 3e-04 0.1212
## x36 -0.0003 -0.0009 0.0004 3e-04 0.3821
## x37 -0.0002 -0.0009 0.0005 3e-04 0.5353
## x38 -0.0002 -0.0008 0.0005 3e-04 0.6204
## x39 0.0003 -0.0004 0.0010 3e-04 0.3606
## x40 0.0000 -0.0007 0.0007 3e-04 0.9917
## x41 0.0002 -0.0004 0.0009 3e-04 0.5045
## x42 0.0002 -0.0005 0.0008 3e-04 0.6014
## x43 0.0001 -0.0006 0.0007 3e-04 0.8068
## x44 0.0008 0.0001 0.0014 3e-04 0.0173
8.3.5 A divide and conque algorithm
For a very large data file on a computer, we can read a part of the data into R and conduct the analysis, and then combine the results together.
For the bigdata1.txt file, we can first inspect the first few rows of data and get the total number of rows of the data file.
readLines('D:/data/bigdata1.txt', n=3)
## [1] "0.078 0.758 -0.694 -0.947 -2.581 -0.149 -0.2 1.964 0.326 0.992 0.881 1.62 2.207 0.372 0.38 1.351 1.971 0.474 0.03 0.959 -0.675 -0.024 -0.102 -1.448 -1.158 0.01 -1.184 -0.572 0.52 -0.608 -0.547 0.598 -0.567 1.471 -1.067 -0.452 0.916 0.379 -0.943 -1.19 -0.086 -1.576 -0.6 1.452 0.189"
## [2] "2.698 2.123 0.131 -0.918 1.051 -0.701 -0.302 1.344 0.328 -0.49 -0.837 -0.658 0.95 0.123 1.191 -0.272 0.086 1.179 0.707 -0.891 -1.135 1.082 -0.332 -0.474 -0.974 -1.049 0.685 -0.203 -1.67 -0.783 -1.205 -1.23 0.5 -0.522 -0.007 1.652 0.104 -0.489 -0.823 -0.013 0.324 0.257 1.557 -0.645 0.619"
## [3] "-0.504 0.357 -1.48 -0.426 -1.107 1.407 -0.847 1.979 0.23 0.673 -0.601 0.242 1.543 0.206 0.241 -0.095 0.432 -0.243 0.265 2.195 2.175 -0.394 1.12 -0.834 2.391 2.097 -1.66 1.927 -0.349 -0.882 -0.1 0.382 -1.084 0.136 0.331 0.413 0.103 -0.609 0.964 1.03 -0.007 0.879 -0.585 -0.194 0.485"
#system('wc -l bigdata1.txt', intern = TRUE)
system('find /v /c "" D:/data/bigdata1.txt')
Now, suppose we would like to fit a regression model to the data. Instead of using all the data at one time, we only read in 10,000 rows of data each time. Therefore, we will do the analysis for 930 times.
## specify the first row of each reading of data
first.row <- seq(0, length=930, by=10000)
all.coef <- list()
for (i in 1:10){ ## change 10 to 930
bg.subset <- read.table("D:/data/bigdata1.txt",
skip = first.row[i],
nrows= 10000,
header = FALSE)
names(bg.subset) <- c('y', paste0('x', 1:44))
model.subset <- lm(y~., data=bg.subset)
all.coef[[i]] <- coef(model.subset)
cat(i, "\n")
}
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
all.coef.matrix <- do.call(rbind, all.coef)
colMeans(all.coef.matrix)
## (Intercept) x1 x2 x3 x4
## 1.538501e-04 7.971463e-01 4.997889e-01 2.368758e-03 -2.127635e-03
## x5 x6 x7 x8 x9
## -4.442199e-03 1.566934e-03 1.922532e-04 2.433141e-04 1.315510e-03
## x10 x11 x12 x13 x14
## 2.349351e-04 -2.323344e-05 -1.482999e-03 3.138025e-03 3.163150e-04
## x15 x16 x17 x18 x19
## -3.359310e-03 2.223121e-03 3.108206e-03 -2.669120e-03 -3.142849e-03
## x20 x21 x22 x23 x24
## -2.291401e-03 3.418715e-03 -4.013795e-03 -3.270496e-03 2.510819e-03
## x25 x26 x27 x28 x29
## -2.935929e-03 -9.711691e-06 -3.827271e-04 2.305276e-04 2.536062e-03
## x30 x31 x32 x33 x34
## 2.261201e-03 -4.328551e-03 6.042383e-04 1.137028e-03 1.985576e-03
## x35 x36 x37 x38 x39
## 2.730040e-03 -2.217446e-03 -2.185202e-03 -4.545015e-03 2.930815e-03
## x40 x41 x42 x43 x44
## 1.609477e-03 -1.350481e-03 1.365137e-03 -1.974796e-03 7.837193e-04