This guide is meant to provide a quick start for using qFeature
package. We begin with installation instructions in ‘Getting Started’. Following installation is a basic diagram of the ‘Package Structure’ to illustrate the hierarchy of functions. Finally, we conclude with details regarding the four ‘Core Functions’ contained within the package complete with follow-along examples.
The qFeature
package is designed to extract features from continuous and discrete variables contained within a time series dataset. These features can then be used as inputs to multivariate statistical procedures like clustering, dimensionality reduction, and classification. This is a high-speed implementation of the feature extraction methods of the Morning Report Algorithms developed by Brett Amidan and Tom Ferryman (Amidan and Ferryman 2005).
The first thing you will need to do is install the qFeature
package. Installation instructions are provided in the README.md file of the package repository. Once the package has been installed, the library can be loaded so that the functions can be used.
library(qFeature)
A complete list of the functions contained within the qFeature
package can be seen using:
help(package = qFeature)
In what follows, we explain the four core functions of qFeature
: ddply_getFeatures()
, getFeatures()
, discFeatures()
, and fitQ
.
This function is intended for use on a time series variable with discrete states and calculates the percentage of time spent at each of those states as well as counting the transitions in the variable from one state to another.
Using discFeatures()
is most easily demonstrated with a simple example. Let’s begin by creating a small dataset with 2 discrete states (TRUE/FALSE).
discData <- c("TRUE", "FALSE", "FALSE", NA, "TRUE", "TRUE", NA, NA, "TRUE", "FALSE", "TRUE", "FALSE", "TRUE")
discData
## [1] "TRUE" "FALSE" "FALSE" NA "TRUE" "TRUE" NA NA
## [9] "TRUE" "FALSE" "TRUE" "FALSE" "TRUE"
Now we have a small data set of length 13 that contains two discrete phases (also called “grouping variables”) and 3 missing values stored as NA. Now if we apply the discFeatures()
function to our dataset we can see what happens.
discFeatures(discData)
## percent.FALSE percent.TRUE num_trans.FALSE_TRUE
## 0.4 0.6 3.0
## num_trans.TRUE_FALSE
## 3.0
As you can see the percentage calculations are made without consideration of the missing values and reflect that 40% of the data are FALSE and 60% of the data are TRUE. Additionally we have some information about transitions or in other words we can see that the value changed from FALSE to TRUE 3 times and from TRUE to FALSE 3 times. Go ahead and count the transitions yourself just to make sure you agree with the output. The transitions have significance because this function is intended to be used on time series data and so a transition from one state to another could be meaningful.
Fits a moving quadratic (or simple linear) regression model over a series using a 2-sided window. It dynamically accounts for the incomplete windows which are caused by missing values and which occur at the beginning and end of the series. This function is used to extract data from continuous variables.
For your vector of response measures, a window is fit to the data that is initially centered at whatever point is indexed by start
. From here the x1
is used to define the width of the window which is extended equally in both directions about a center point (which is why the length must be odd). min.window
defines how many points are required in order to fit a model and subsequently produce a value in our signature. Once our window characteristics are defined we can choose one of two regression models:
Linear: \(y=b_0+b_1 x_1+\epsilon\)
or
Quadratic: \(y=b_0+b_1 x_1+b_2 x_2^2+\epsilon\)
After our initial linear or quadratic model is fit, several features are extracted from the regression and denoted as a, b, c, and d which are defined below.
a: The estimated intercepts
b: The estimated linear coefficients
c: The estimated quadratic coefficients. These are NA if linear.only = TRUE
d: The root mean squared error (RMSE)
The data moves through the window by an increment of skip
which has a default of 1 and fits the regression model using the data contained in each new window. This iterates over the entire vector of the response measure in order to produce a signature of features with a corresponding a, b, c, and d for each regression that was fit.
The illustration below provides a simple demonstration of how fitQ()
fits a series of regression models to finite windows of data. The gray points denote data that falls outside the current window that is being used in fitting the regression model. In this example a window of x1
= -3:3 is used with a min.window
= 4. Unless otherwise specified the first window will begin centered on the first data point and likewise the last window will be centered on the last data point.
Now let’s take a look at a few quick R example that illustrate the function at work as well as how the function reacts to several potential issues in the data.
We begin by creating our first sample data set to help understand where the features are coming from.
set.seed(10)
fitqDataEx1 <- rnorm(7, 5, 1)
fitqDataEx1
## [1] 5.018746 4.815747 3.628669 4.400832 5.294545 5.389794 3.791924
We now have a small vector of length 7 containing randomly generated numbers. Now we can pass the vector into our fitQ()
function and take a look at the output.
fitQ(y = fitqDataEx1, x1 = -3:3, min.window = 4)
## $a
## [1] 5.165912 4.321421 4.232649 4.635257 4.795443 5.028309 3.836657
##
## $b
## [1] -1.03545313 -0.57956943 -0.05914872 -0.03094635 -0.05480118 -0.65967154
## [7] -2.04183511
##
## $c
## [1] 0.243790387 0.296618854 0.175257194 -0.003805034 -0.085028875
## [6] -0.395609306 -0.622895830
##
## $d
## [1] 0.6581464 0.4737011 0.5094792 0.8563527 0.9110471 0.4057781 0.2000512
##
## attr(,"class")
## [1] "fitQ" "list"
It may not be readily apparent, but there are the same number of values for each of the 4 features extracted and the number of values should equal the length of the vector y
(when skip
=1) because the center of the first window is on the first point and the center moves by an increment of skip
which has a default of 1 through the entire vector of y
.
Furthermore, the first values of a, b, c, and d are all drawn from the same window as are the second, third, and so on and so forth. We can illustrate this by manually fitting the 2nd window and comparing the results. Realize at the second window fit there will only be 5 points that fall inside the window and are fit to the regression, which can be seen in the illustration above. For this reason we will fit the regression to the first 5 points from our dataset and the corresponding section of our window.
y <- fitqDataEx1[1:5]
x1 <- c(-1:3)
summary(lm(y ~ x1 + I(x1^2)))
##
## Call:
## lm(formula = y ~ x1 + I(x1^2))
##
## Residuals:
## 1 2 3 4 5
## -0.17886 0.49433 -0.40980 0.05207 0.04226
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.3214 0.2887 14.969 0.00443 **
## x1 -0.5796 0.2942 -1.970 0.18765
## I(x1^2) 0.2966 0.1266 2.343 0.14387
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4737 on 2 degrees of freedom
## Multiple R-squared: 0.7332, Adjusted R-squared: 0.4665
## F-statistic: 2.749 on 2 and 2 DF, p-value: 0.2668
As you can see (aside from rounding):
Hopefully the first example helps to solidify how the feature extraction is being populated. Let’s continue by taking a look at a few issues you might encounter.
First begin by taking a look at the illustration above and ask yourself “When the first window is centered on the first point, what happens if our min.window
= 5 instead?”. Let’s take a look at an example to demonstrate this scenario. We begin by creating another data set.
set.seed(20)
fitqDataEx2 <- rnorm(7, 5, 1)
fitqDataEx2
## [1] 6.162685 4.414076 6.785465 3.667406 4.553433 5.569606 2.110282
We now have a small vector of length 7 containing randomly generated numbers. Now we can pass the vector into our fitQ()
function with a min.window
=5 and see how the function responds.
fitQ(y = fitqDataEx2, x1 = -3:3, min.window = 5)
## $a
## [1] NA 5.528857 4.897691 5.100493 4.913496 3.831878 NA
##
## $b
## [1] NA -0.3650638 -0.3026263 -0.4313635 -0.6333119 -0.8237286
## [7] NA
##
## $c
## [1] NA -0.01572677 0.14075827 -0.08716049 -0.22529714 -0.03945601
## [7] NA
##
## $d
## [1] NA 1.614705 1.386009 1.521011 1.630093 1.897100 NA
##
## attr(,"class")
## [1] "fitQ" "list"
As you can see in the output an NA is produced for the first value of each of the features as well as at the end. This is because we are requiring 5 points to fit a regression and since the first and last windows each contain only 4 points (refer to Window 1 and Window 10 of the illustration if you need to convince yourself), no regression will be fit and there will be no features to extract so the function will return NAs.
So what happens then the function encounters missing values in the data? We begin by creating another data set, this time with missing values.
set.seed(30)
fitqDataEx3 <- rnorm(15, 5, 1)
fitqDataEx3[c(5,7, 9, 10)] <- NA
fitqDataEx3
## [1] 3.711482 4.652311 4.478371 6.273473 NA 3.488692 NA
## [8] 4.239204 NA NA 3.976728 3.180602 4.332210 4.940702
## [15] 5.880166
We now have an example data vector of length 15 with 11 non-missing and 4 missing values. We can now implement the function and see what happens.
fitQ(y = fitqDataEx3, x1 = -3:3, min.window = 4)
## $a
## [1] 3.865672 4.189739 5.337928 5.371472 4.791795 4.395873 NA
## [8] NA 4.391305 3.629989 3.538460 3.696237 4.043787 4.809632
## [15] 5.836461
##
## $b
## [1] 0.11049855 0.53763516 0.32774826 -0.26065985 -0.15740980
## [6] -0.23641293 NA NA -0.03461775 -0.13110603
## [11] 0.09938414 0.13840282 0.55669758 0.97499234 0.71161019
##
## $c
## [1] 0.21356830 0.21356830 -0.29420406 -0.29420406 -0.04757267
## [6] 0.01100045 NA NA -0.11501634 0.09628955
## [11] 0.11647796 0.20914738 0.20914738 0.20914738 -0.05303605
##
## $d
## [1] 0.6895608 0.6895608 0.9035960 0.9035960 1.3103955 1.8098212 NA
## [8] NA 0.1557777 0.7259587 0.5212704 0.4599708 0.4599708 0.4599708
## [15] 0.1954521
##
## attr(,"class")
## [1] "fitQ" "list"
We can see that NA values were produced in each of the features that correspond with the 7th and 8th windows fit to the data. Let’s investigate the data in those windows to see why.
Window7 <- fitqDataEx3[4:10]
Window7
## [1] 6.273473 NA 3.488692 NA 4.239204 NA NA
Window8 <- fitqDataEx3[5:11]
Window8
## [1] NA 3.488692 NA 4.239204 NA NA 3.976728
As you can see, in the range of data points for both Window7
and Window8
we only have 3 non-missing data points which resulted in our inability to fit regressions and consequently produced NA values in our feature extraction.
The getFeatures()
function is the main workhorse of the qFeature()
package. It is called against a data frame, where fitQ()
is applied to the continuous variables and discFeatures()
is applied to the discrete (or categorical) variables.
At this point it is important that you understand what is happening in both the fitQ()
and the discFeatures()
functions, because getFeatures()
is simply an aggregate of the two. Rather than dealing with a single vector at a time, this function is capable of dealing with a data frame consisting of multiple continuous and/or discrete variables. The only additional functionality introduced by the getFeatures()
function is the ability to output desired summary statistics of the features extracted from the regression models.
Let’s take a look at an example of the getFeatures()
function. We begin by creating a data frame that consists of 2 continuous and 2 discrete state variables.
set.seed(10)
cont1 <- rnorm(10,9,1)
cont2 <- runif(10,0,10)
disc1 <- discData <- c("T", "F", "F",
"T", "T", "T",
"F", "T", "F",
"T")
disc2 <- c("blue", "red", "yellow",
"yellow", "blue", "red",
"blue", "red", "yellow",
"blue")
getFeaturesEx <- data.frame(cont1, cont2, disc1, disc2)
getFeaturesEx
## cont1 cont2 disc1 disc2
## 1 9.018746 8.647212 T blue
## 2 8.815747 6.153524 F red
## 3 7.628669 7.751099 F yellow
## 4 8.400832 3.555687 T yellow
## 5 9.294545 4.058500 T blue
## 6 9.389794 7.066469 T red
## 7 7.791924 8.382877 F blue
## 8 8.636324 2.395891 T red
## 9 7.373327 7.707715 F yellow
## 10 8.743522 3.558977 T blue
We learned earlier that qFit()
could be used to handle a continuous vector and discFeatures()
is able to extract information from the discrete state variable, but let’s push the whole data frame into getFeatures()
.
outGetFeatures <- getFeatures(getFeaturesEx, cont = 1:2, disc = 3:4,
stats = c("mean", "sd"), fitQargs = list(x1 = -3:3))
outGetFeatures
## cont1.a.mean cont1.a.sd
## -0.11625763 0.51336981
## cont1.b.mean cont1.b.sd
## -0.09775823 0.37153955
## cont1.c.mean cont1.c.sd
## 0.12170029 0.23103515
## cont1.d.mean cont1.d.sd
## 1.00407293 0.21753467
## cont2.a.mean cont2.a.sd
## -0.05175897 0.35596310
## cont2.b.mean cont2.b.sd
## -0.19217181 0.20769032
## cont2.c.mean cont2.c.sd
## 0.01532974 0.12476572
## cont2.d.mean cont2.d.sd
## 1.05163189 0.28870859
## disc1.percent.F disc1.percent.T
## 0.40000000 0.60000000
## disc1.num_trans.F_T disc1.num_trans.T_F
## 3.00000000 3.00000000
## disc2.percent.blue disc2.percent.red
## 0.40000000 0.30000000
## disc2.percent.yellow disc2.num_trans.blue_red
## 0.30000000 3.00000000
## disc2.num_trans.blue_yellow disc2.num_trans.red_blue
## 0.00000000 1.00000000
## disc2.num_trans.red_yellow disc2.num_trans.yellow_blue
## 2.00000000 2.00000000
## disc2.num_trans.yellow_red
## 0.00000000
It should be apparent that this output is different than what you saw in fitQ()
and that is because once you start looking across multiple variables it no longer makes sense to look at a long string of features. Instead we look at feature summary statistics.
[variable].[feature].[stat]
In our example we include mean and standard deviation for all the values extracted for that feature. There are many more summary statistics that could be included that are listed in the getFeatures()
help page.
The discrete variables are presented almost identically what we saw in the discFeatures()
function. The only real difference here is that now we have a variable identifier to discriminate between discrete variables. Other than that we still have a summary of the percentage of time spent at each state as well as a count of each type of transition present in the data.
[variable].[frequency].[from]_[to]
The ddply_getFeatures()
function is simply a wrapper that allows the getFeatures()
function to be implemented for each “group” in a data frame, where the groups are defined by the unique combinations
of the values of one or more categorical variables. The replication of getFeatures()
for each group is carried out by ddply()
from the plyr
package.
The importance of this wrapper is that it facilitates processing unique subsets in the data and allows for parallel processing.
For this example let’s use a dataset built into the qFeature
package.
data(demoData)
str(demoData)
## 'data.frame': 628 obs. of 11 variables:
## $ subject: Factor w/ 7 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ phase : Factor w/ 3 levels "e","f","g": 1 1 1 1 1 1 1 1 1 1 ...
## $ contA : num 670 790 501 657 629 ...
## $ contB : num 0 41.6 65.9 83.2 96.6 ...
## $ contC : num 181.2 245.4 51.4 93.7 18.6 ...
## $ contD : num 5 8 72 80 200 216 343 384 648 900 ...
## $ contE : num 3.74 12.19 13.53 46.97 40.58 ...
## $ discA : int 4 1 2 4 3 3 4 3 4 4 ...
## $ discB : logi TRUE FALSE FALSE FALSE TRUE TRUE ...
## $ discC : Factor w/ 3 levels "j","k","l": 3 2 1 3 3 1 3 3 2 3 ...
## $ discD : chr "x" "y" "y" "z" ...
Now that we have a data set let’s pretend we are interested in every combination of subject and phase. At this point it is important to note that in order to work properly, the package expects the data it is working with to be structured in several ways. First, it is essential that any data set being processed by this package contain a field which indicating to which grouping variable each record is associated. Second, the package assumes that your data is presented in chronological order (from oldest to newest) throughout each grouping variable. Once you’ve ensured that your data is structured, you can then proceed to process it through the ddply_getFeatures
function. The output will be very similar to getFeatures
however in this case instead of getting one value for each summary, we will end up with a set of values for each summary equal to the number of unique combinations. A quick calculation shows us that:
\((\mbox{number of subjects}) \cdot (\mbox{number of phases}) = \mbox{number of combinations}\)
meaning that we will have:
\((7) \cdot (3) = 21\) values for each output variable.
We can verify this by pushing our data into the ddply_getFeatures()
function.
f <- ddply_getFeatures(demoData, c("subject", "phase"),
cont = 3:4, disc = 8, stats = c("mean", "sd"),
fitQargs = list(x1 = -5:5), nJobs = 2)
str(f)
## 'data.frame': 21 obs. of 34 variables:
## $ subject : Factor w/ 7 levels "1","2","3","4",..: 1 1 1 2 2 2 3 3 3 4 ...
## $ phase : Factor w/ 3 levels "e","f","g": 1 2 3 1 2 3 1 2 3 1 ...
## $ contA.a.mean : num 0.9594 -0.0963 0.6678 -0.2903 -0.5715 ...
## $ contA.a.sd : num 0.405 0.412 0.457 0.381 0.316 ...
## $ contA.b.mean : num -0.05352 0.00514 0.07705 -0.09571 -0.02266 ...
## $ contA.b.sd : num 0.1473 0.1887 0.2799 0.2819 0.0589 ...
## $ contA.c.mean : num 0.00119 0.01584 -0.00711 0.00979 0.00112 ...
## $ contA.c.sd : num 0.037 0.0359 0.0695 0.0573 0.0237 ...
## $ contA.d.mean : num 1.077 0.777 0.977 0.73 0.785 ...
## $ contA.d.sd : num 0.18 0.159 0.469 0.208 0.207 ...
## $ contB.a.mean : num 0.883 -0.235 0.557 -0.138 -0.214 ...
## $ contB.a.sd : num 1.124 0.809 1.171 0.92 0.87 ...
## $ contB.b.mean : num 0.03379561516091365847 0.037014245176244277291 0.000000000000000000601 -0.002676778537393192763 -0.00589473771636578759 ...
## $ contB.b.sd : num 0.311 0.281 0.365 0.304 0.273 ...
## $ contB.c.mean : num -0.0193 -0.0212 -0.0248 -0.022 -0.0187 ...
## $ contB.c.sd : num 0.02 0.0152 0.024 0.016 0.0165 ...
## $ contB.d.mean : num 0.0443 0.0495 0.0602 0.0476 0.0429 ...
## $ contB.d.sd : num 0.0409 0.0288 0.0483 0.0247 0.0309 ...
## $ discA.percent.1 : num 0.25 0.25 0.27 0.258 0.257 ...
## $ discA.percent.2 : num 0.25 0.25 0.243 0.258 0.257 ...
## $ discA.percent.3 : num 0.25 0.25 0.243 0.258 0.257 ...
## $ discA.percent.4 : num 0.25 0.25 0.243 0.226 0.229 ...
## $ discA.num_trans.1_2: num 5 1 2 4 1 3 3 1 3 1 ...
## $ discA.num_trans.1_3: num 1 3 2 3 2 3 4 1 2 2 ...
## $ discA.num_trans.1_4: num 1 2 2 0 0 3 0 2 2 2 ...
## $ discA.num_trans.2_1: num 3 1 3 2 2 3 2 1 4 1 ...
## $ discA.num_trans.2_3: num 2 3 1 2 3 2 1 2 1 1 ...
## $ discA.num_trans.2_4: num 1 1 3 1 3 2 3 2 1 3 ...
## $ discA.num_trans.3_1: num 2 2 1 1 2 3 3 2 1 2 ...
## $ discA.num_trans.3_2: num 0 2 1 2 2 1 0 1 2 2 ...
## $ discA.num_trans.3_4: num 5 2 4 3 3 3 2 2 1 0 ...
## $ discA.num_trans.4_1: num 3 3 1 3 0 2 1 2 1 2 ...
## $ discA.num_trans.4_2: num 1 2 4 0 4 4 3 2 1 1 ...
## $ discA.num_trans.4_3: num 4 0 4 1 2 2 1 2 2 1 ...
If you need a reminder what all the variables being displayed are coming from, you can go ahead and read through the getFeatures()
section above. We can clearly see that there are 21 values for each variable that correspond to the 21 combinations of subject and phase. Here is a quick look at the data.
head(f)
## subject phase contA.a.mean contA.a.sd contA.b.mean contA.b.sd
## 1 1 e 0.95935578 0.4049560 -0.053521124 0.14729155
## 2 1 f -0.09634974 0.4122154 0.005139956 0.18871465
## 3 1 g 0.66780935 0.4570175 0.077047009 0.27991844
## 4 2 e -0.29030386 0.3812480 -0.095714751 0.28194231
## 5 2 f -0.57154431 0.3163018 -0.022664253 0.05891394
## 6 2 g -0.33432391 0.2662339 -0.037518578 0.10858981
## contA.c.mean contA.c.sd contA.d.mean contA.d.sd contB.a.mean contB.a.sd
## 1 0.0011891119 0.03701984 1.0773529 0.18018835 0.8828291 1.1243437
## 2 0.0158372006 0.03588238 0.7773663 0.15862451 -0.2349595 0.8088642
## 3 -0.0071132548 0.06948251 0.9774247 0.46886780 0.5571112 1.1712935
## 4 0.0097902953 0.05731774 0.7298378 0.20754398 -0.1379738 0.9195388
## 5 0.0011172144 0.02373771 0.7848196 0.20694008 -0.2138106 0.8699450
## 6 -0.0001564956 0.03137132 0.6432965 0.08199484 0.1598097 0.9701766
## contB.b.mean contB.b.sd contB.c.mean contB.c.sd
## 1 0.0337956151609136584701432 0.3107336 -0.01931491 0.02003349
## 2 0.0370142451762442772911221 0.2806856 -0.02115442 0.01515542
## 3 0.0000000000000000006007066 0.3646303 -0.02475077 0.02401877
## 4 -0.0026767785373931927625502 0.3038650 -0.02200054 0.01597915
## 5 -0.0058947377163657875900893 0.2725885 -0.01874485 0.01654354
## 6 -0.0008357828409793670657035 0.2744961 -0.01737797 0.01770307
## contB.d.mean contB.d.sd discA.percent.1 discA.percent.2 discA.percent.3
## 1 0.04433838 0.04089909 0.2500000 0.2500000 0.2500000
## 2 0.04945053 0.02884678 0.2500000 0.2500000 0.2500000
## 3 0.06022609 0.04830885 0.2702703 0.2432432 0.2432432
## 4 0.04759370 0.02473617 0.2580645 0.2580645 0.2580645
## 5 0.04290807 0.03086693 0.2571429 0.2571429 0.2571429
## 6 0.03897107 0.03278651 0.2682927 0.2439024 0.2439024
## discA.percent.4 discA.num_trans.1_2 discA.num_trans.1_3
## 1 0.2500000 5 1
## 2 0.2500000 1 3
## 3 0.2432432 2 2
## 4 0.2258065 4 3
## 5 0.2285714 1 2
## 6 0.2439024 3 3
## discA.num_trans.1_4 discA.num_trans.2_1 discA.num_trans.2_3
## 1 1 3 2
## 2 2 1 3
## 3 2 3 1
## 4 0 2 2
## 5 0 2 3
## 6 3 3 2
## discA.num_trans.2_4 discA.num_trans.3_1 discA.num_trans.3_2
## 1 1 2 0
## 2 1 2 2
## 3 3 1 1
## 4 1 1 2
## 5 3 2 2
## 6 2 3 1
## discA.num_trans.3_4 discA.num_trans.4_1 discA.num_trans.4_2
## 1 5 3 1
## 2 2 3 2
## 3 4 1 4
## 4 3 3 0
## 5 3 0 4
## 6 3 2 4
## discA.num_trans.4_3
## 1 4
## 2 0
## 3 4
## 4 1
## 5 2
## 6 2
Hopefully now you have the knowledge required to implement any of the core functions successfully in your own data set. As you use the package, bear in mind that output from qFeature
functions are features, or summary statistics, that contain information about the behavior of the time series. These features will likely need further graphical and statistical analysis, typically using multivariate statistical techniques like dimensionality reduction, clustering, and/or classification.
Amidan BG, Ferryman TA. 2005. “Atypical Event and Typical Pattern Detection within Complex Systems.” IEEE Aerospace Conference Proceedings, March 2005.
A mathematical description of the algorithms in fitQ
and discFeatures
is available here.
Alternatively, after installing the qFeature
package, the description is available locally, and this R command will show you where it is located:
file.path(path.package("qFeature"), "doc", "Explanation_of_qFeature_algorithms.pdf")