Skip to content

Latest commit

 

History

History
236 lines (162 loc) · 9.63 KB

examples.md

File metadata and controls

236 lines (162 loc) · 9.63 KB

Some examples

A brief introduction

Suppose the linear regression model is

y = Xβ + ε

where y is the vector of dependent variable, X is the n ˟ p matrix of design, ε is i.i.d error term with zero mean, n is the number of observations, and p is the number of regression parameters.

When a single regressor exists in the model, it can be basically written as

y = β₀ + β₁ x + ε

where β₀ and β₁ are unknown intercept and slope parameters. In R and Julia we can represent this model in a similar form. Specifically, in Julia, the simple model can be expressed using the @formula macro as

@formula(y ~ x)

where ~ operator seperates the dependent and independent variables. When the model includes more than one regressors, the model can similarly be expressed as

@formula(y ~ x1 + x2 + x3)

LinRegOutliers follows this convention for expressing linear models.


Sebert & Montgomery & Rollier (1998) Algorithm

Sebert & Montgometry & Rollier (smr98) algorithm starts with an ordinary least squares estimation for a given model and data. Residuals and fitted responses are calculated using the estimated model. A hierarchical clustering analysis is applied using standardized residuals and standardized fitted responses. The tree structure of clusters are cut using a threshold, e.g Majona criterion, as suggested by the authors. It is expected that the subtrees with relatively small number of observations are declared to be clusters of outliers.

Hawkings & Bradu & Kass dataset has 4 variables and 75 observations. The observations 1-14 are known to be outliers. In the example below, we create an regression setting using the formula y ~ x1 + x2 + x3 and hbk dataset. smr98 is directly applied on this setting.

julia> using LinRegOutliers
julia> # Regression setting for Hawkins & Bradu & Kass data
julia> reg = createRegressionSetting(@formula(y ~ x1 + x2 + x3), hbk)
julia> smr98(reg)
Dict{String,Array{Int64,1}} with 1 entry:
  "outliers" => [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]

The Julia method smr98() returns a Dict object with produced output. In this case, the single output is indices of detected outlying observations. The algorithm successfully detects the outliers.


Peña and Yohai (1995)

Peña and Yohai (py1995) algorithm starts by constructing an influence matrix using results of an ordinary least squares estimate for a given model and data. In the second stage, the eigen structure of the influence matrix is examined for detecting subset of potential outliers of data.

Here is an example of py95 method applied on the hbk data. The method returns a Dict object with keys outliers and suspected.sets. An usual researcher may directly focus on the outliers indices. The method reports the observations 1-14 are outliers.

julia> py95(reg)
Dict{Any,Any} with 2 entries:
  "outliers"       => [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
  "suspected.sets" => Set([[14, 13], [43, 54, 24, 38, 22], Int64[], [58, 66, 32, 28, 65, 36], [62], [73, 2

Least Trimmed Squares Regression

Least Trimmed Squares (LTS) is a robust regression estimator with high break-down point. LTS searches for the parameter estimates that minimize sum of the h smallest squared residuals where h is a constant larger than n/2.

Phone data is a regression data with a single regressor variable. The independent and dependent variables are year and calls and have 24 observations. Observations 14-21 are said to be outliers.

Since LTS is a robust method, the parameter estimates are of interest. However, we provide indices of outliers in results for diagnostic purposes only.

The method lts also reports standard deviation of estimate, scaled residuals and LTS objective function as well.

julia> reg = createRegressionSetting(@formula(calls ~ year), phones);

julia> lts(reg)
Dict{Any,Any} with 6 entries:
  "betas"            => [-56.5219, 1.16488]
  "S"                => 1.10918
  "hsubset"          => [11, 10, 5, 6, 23, 12, 13, 9, 24, 7, 3, 4, 8]
  "outliers"         => [14, 15, 16, 17, 18, 19, 20, 21]
  "scaled.residuals" => [2.41447, 1.63472, 0.584504, 0.61617, 0.197052, -0.222066, -0.551027, -0.970146, -0.397538, -0.185558    91.0312, 94.4889, 109.667, 123.943, 143.629, 
  "objective"        => 3.43133
using Plots
x = phones[:,"year"]
y = phones[:,"calls"]
f(x) = -56.5219 +  1.16488x 
scatter(x, y, label=false, title="Phone Data")
px = [x[1], x[end]]
py = map(f, px)
plot!(px, py, label=false, color=:red, width=2) 

dataimages

Figure 1 - Phone Data and estimated LTS line


Data Images

The method dataimage implements the Data Image algorithm and serves a visual tool as an outlier detection algorithm for multivariate data only. The algorithm generates a color matrix with each single cell represents a proper distance between observations. Since

dataimage(data, distance = :euclidean)

defines color using the Euclidean distance, whereas

dataimage(data, distance = :mahalabobis)

uses Mahalanobis distances for determining color values. The default distance metric is Euclidean distance.

In the example below, the distances between observations are calculated and drawn using corresponding colors. Since the method is for multivariate data, only the desing matrix is used. In other terms, the response vector is omitted.

julia> # Matrix of independent variables of Hawkins & Bradu & Kass data
julia> data = hcat(hbk.x1, hbk.x2, hbk.x3);
julia> dataimage(data)

dataimages

Figure 2 - Data Image of Design Matrix of hbk Data


Atkinson's Stalactite Plot

Atkinson's Stalactite Plot serves a visual method for detecting outliers in linear regression. Despite it shares the same calling convention with the other methods, the method atkinsonstalactiteplot generates a text based plot. The method performs a robust regression estimator many times and residuals higher than some threshold are labelled using + and *. After many iterations, the observations with many labels are considered as suspected or outlying observations.

julia> using LinRegOutliers
julia> reg = createRegressionSetting(@formula(calls ~ year), phones);
julia> atkinsonstalactiteplot(reg)
m           1         2
   123456789012345678901234
 2              ********   
 3              ********   
 4 +            ********   
 5 +            ********   
 6 +            ********   
 7 +            ********   
 8 +            ********   
 9 +            ********+  
10 +            ********+  
11 +            ********   
12              ********   
13 +            ********   
14              ********   
15              ********   
16              ********   
17              ********   
18               *******+  
19               ****** +++
20               ****** +++
21               ****** +++
22               ****** +++
23               +++*** +++
24                  ++* +++
   123456789012345678901234
            1         2

The output above can be considered as an evidence that the observations 14-21 are suspected. Observations 1, 22, 23, 24 are also labelled as + in some iterations. However, the frequency of labels of these observations are relatively small.


Other algorithms

LinRegOutliers implements more than 20 outlier detection methods in linear regression and covers a big proportion of the classical literature in this subject. The documentation of the package includes the referenced citations. Any researcher can follow the details of algorithms using these information.


Other calling conventions

The calling convention

julia> setting = createRegressionSetting(@formula(...), data)
julia> method(setting) 

is the preferred way of calling implemented methods in LinRegOutliers, we multiple dispatch the methods using the syntax

julia> method(X, y) 

where X is the design matrix and y is the response vector. This calling convention may be more suitable for those who iteratively calls the methods possibly in a simulation study or other kinds of researching stuff.

For example, we can perform hs93 on the Phones data using

julia> hs93(reg)
Dict{Any,Any} with 3 entries:
  "outliers" => [14, 15, 16, 17, 18, 19, 20, 21]
  "t"        => -3.59263
  "d"        => [2.04474, 1.14495, -0.0633255, 0.0632934, -0.354349, -0.766818, -1.06862, -1.47638, -0.710

as well as

julia> X = hcat(ones(24), phones[:, "year"]);

julia> y = phones[:, "calls"];

julia> hs93(X, y)
Dict{Any,Any} with 3 entries:
  "outliers" => [14, 15, 16, 17, 18, 19, 20, 21]
  "t"        => -3.59263
  "d"        => [2.04474, 1.14495, -0.0633255, 0.0632934, -0.354349, -0.766818, -1.06862, -1.47638, -0.710

Multiple Methods in a single shot!

We also provide detectOutliers method for data scientist for performing many methods and presenting the summarized results.

The method can be called using default arguments only by feeding a regression setting object:

julia> detectOutliers(aSettingObject)

The method generates a console output:

dataimages