Title: | Empirical CDF for Heavy Tailed Data |
---|---|
Description: | Computes and plots a transformed empirical CDF (ecdf) as a diagnostic for heavy tailed data, specifically data with power law decay on the tails. Routines for annotating the plot, comparing data to a model, fitting a nonparametric model, and some multivariate extensions are given. |
Authors: | John P Nolan <[email protected]> |
Maintainer: | John P Nolan <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.1.1 |
Built: | 2025-02-28 04:16:32 UTC |
Source: | https://github.com/cran/ecdfHT |
The ecdfHT
package computes and plot a transformed empirical cdf for data.
This is useful because a standard empirical cdf (ecdf) gives little information
about the tails of the data when there are extreme values.
The transform is nonparametric: linear in the middle of the data and matched to a log-log transform on the tails, where the tail regions are determined by quantiles. If the data has power law behavior on the tails, the plot is linear on those tails, so this plot can be used as a graphical diagnostic to determine if a data set is heavy tailed.
In addition, there are functions to
annotate the plot, add custom axes and grid lines
overlay proposed models on the plot
fit the tails using linear regression on the transformed tails
combine the empirical cdf in the middle and the above fit on the tails to get a semi-parametric fit to the data
compute cdf, pdf, quantiles, and simulate from the semi-parametric fit
some multivariate plots that look at tail behavior of multiple components and some idea of the dependence.
I will try to fix the code if you provide a simple demonstration of a bug. Polite suggestions for improvements will be considered if there is time available.
ecdfHT
for a basic plot,
ecdfHT.draw
for annotations and additions to a basic plot,
ecdfHT.fit
to fit a semi-parametric model to the data,
pecdfHT
to compute the cdf, pdf, quantiles and simulate from a
semi-parametric model,
ecdfHT.multivar
for multivariate generalizations.
Produces a basic plot showing a transformed empirical cdf for heavy tailed data. It uses a log-log transform on the tails, which shows power law decay as linear.
ecdfHT(x, scale.q = c(0.25, 0.5, 0.75), show.axes.labels = TRUE, show.plot = TRUE, type = "p", ...)
ecdfHT(x, scale.q = c(0.25, 0.5, 0.75), show.axes.labels = TRUE, show.plot = TRUE, type = "p", ...)
x |
A vector of data |
scale.q |
A vector of 3 probabilites; specifies the quantiles of the data to use for the left tail, mid region, and right tail |
show.axes.labels |
Boolean value: indicates whether default labels are plotted or not. (Use function |
show.plot |
Boolean value: show plot or only do calculations |
type |
Type of plot, passed to plot. Use type='p' for points, type='l' for lines |
... |
Optional graphical parameters, e.g. col='red' |
Most of the work is done by ecdfHT.draw
and the associated helper functions.
Assuming no repeats in x, ecdf = (standard ecdf - (1/2))/n, like type=5 in the R function quantile
.
So instead of taking values 1/n, 2/n, 3/n, ... , k/n, ..., 1 it takes values
1/(2n), 3/(2n), ..., (2k-1)/(2n), ..., (2n-1)/(2n).
This avoids 0 at lower endpoint and 1 at upper endpoint, which causes problems
when we extend tails with a power law. (If there are m repeated x values, then the corresponding
jump in the ecdf at that point is m/n instead of 1/n.)
The default values scale.q=c(.25,.5,.75) splits the data into quartiles; picking different quantiles splits the data into 4 different groups: the lowest group is the left tail, i.e. all values less than the quantile corresponding to scale.q[1]; the next group is between the left tail and center = quantile scale.q[2]); the third group is the center and quantile scale.q[3]; the last group is the upper tail. For two-sided data, it makes sense to use something like (p,0.5,1-p) for scale.q, where p is choosen to determine where the tail regions begin.
For one-sided data, it makes sense to use scale.q=c(0,0,p). In this case, the first two groups are empty and the effect is to divide the data into two groups: a moderate/lower range and a right tail. See the example below with nonnegative data.
The transformations h(x) acts on these different regions. It is linear on the middle two regions and logarithmic on the tails. The transformation g(p) acts on the corresponding values of the ecdf described above. The basic plot shows (h(x[i]),g(ecdf[i])): the first component is a monotonic transform of the x values, the second component is a monotonic transform of the ecdf. See the accompanying vignette for exact definitions: go to the package index and click on User guides, package vignettes and other documentation.
An object of class 'ecdfHT.transform' which gives the information necessary to draw the plot and later add other curves and labels. This list is returned invisibly and contains the following fields:
vector of length 3, copied from the input argument
vector of length 3, the quantiles from the data corresponding to scale.q
vector of the sorted, unique data values
nonstandard empirical cdf, see details
transformed x values: xx[i]=h(xsort[i])
transformed ecdf values: yy[i]=g(ecdf[i])
ecdfHT.draw
for annotations and additions to a basic plot
x <- rcauchy( 1000 ) ecdfHT( x ) title("basic ecdfHT plot") xabs <- abs(x) ecdfHT( xabs, scale.q=c(0,0,.75) ) title("one sided data")
x <- rcauchy( 1000 ) ecdfHT( x ) title("basic ecdfHT plot") xabs <- abs(x) ecdfHT( xabs, scale.q=c(0,0,.75) ) title("one sided data")
Does the computations and plotting for ecdfHT
and can be used to add to an existing plot.
ecdfHT.draw(transform.info, x, p, show.plot = TRUE, new.plot = FALSE, show.ci = FALSE, xlab = "x", ylab = "", ...) ecdfHT.axes(transform.info, x.labels = c(), y.labels = c(), show.vert.gridlines = FALSE, show.horiz.gridlines = FALSE, ...) ecdfHT.h(x, t) ecdfHT.g(p, q)
ecdfHT.draw(transform.info, x, p, show.plot = TRUE, new.plot = FALSE, show.ci = FALSE, xlab = "x", ylab = "", ...) ecdfHT.axes(transform.info, x.labels = c(), y.labels = c(), show.vert.gridlines = FALSE, show.horiz.gridlines = FALSE, ...) ecdfHT.h(x, t) ecdfHT.g(p, q)
transform.info |
A list with information about the transformation, computed in |
x |
The data, a vector of double precision numbers. Assumbed to be sorted and have distinct values. |
p |
Probabilities, a vector of doubles. Typically p[i]=(i=0.5)/length(x), unless there are repeats in x. |
show.plot |
Boolean value: indicates whether to plot or not. |
new.plot |
Boolean value: indicates whether to produce a new plot or add to an existing plot. |
show.ci |
Boolean value: indicates whether or not confidence intervals are shown. |
xlab |
String to label the horizontal axis. |
ylab |
String to label the vertical axis. |
... |
Optional parameters for the plot, e.g. col='red'. |
x.labels |
Vector of numbers specifying the location of the labels on the horizontal axis |
y.labels |
Vector of numbers specifying the location of the labels on the vertical axis |
show.vert.gridlines |
Boolean value indicating whether or not vertical grid lines should be drawn. |
show.horiz.gridlines |
Boolean value indicating whether or not horizontal grid lines should be drawn. |
t |
A vector of length 3 that specifies the x values that determine the left tail, middle, and right tail |
q |
A vector of length 3 that specifies the quantile values that determine the left tail, middle, and right tail. |
ecdfHT.draw
computes transform and plots.
ecdfHT.axes
draws axes on the plot; it can be used to manually select tick marks, etc.
ecdfHT.h
computes the function h(x) for the transformation of the horizontal axis.
ecdfHT.g
computes the function g(p) for the transformation of the vertical axis.
Always call ecdfHT
first to produce the basic plot, then use ecdfHT.draw
to add other curves to the plot as in the examples below
A list of values used in the plot, see return value of ecdfHT
.
ecdfHT.h
returns the vector y=h(x;t), ecdfHT.g
returns the vector y=g(p;q)
set.seed(1) x <- rcauchy( 1000 ) t.info <- ecdfHT( x, show.axes=FALSE ) ecdfHT.axes( t.info, x.labels=c(-50,-5,0,5,50), y.labels=c(.001,.01,.1,.5,.9,.99,.999), show.vert.gridlines=TRUE, show.horiz.gridline=TRUE, lty=2 ) q1 <- qcauchy(t.info$ecdf) # Cauchy quantiles ecdfHT.draw( t.info, q1, t.info$ecdf, col='red',show.ci=TRUE) q2 <- qnorm(t.info$ecdf,sd=sd(x)) # Gaussian quantiles ecdfHT.draw( t.info, q2, t.info$ecdf, col='green',show.ci=TRUE) title(paste("simulated Cauchy data, n=",length(x),"\nred=Cauchy cdf, green=normal cdf")) x <- seq(-5,5,1) t <- c(-3,0,3) ecdfHT.h(x,t) p <- seq(0.05,.95,.1) q <- c(.1,.5,.9) ecdfHT.g(p,q)
set.seed(1) x <- rcauchy( 1000 ) t.info <- ecdfHT( x, show.axes=FALSE ) ecdfHT.axes( t.info, x.labels=c(-50,-5,0,5,50), y.labels=c(.001,.01,.1,.5,.9,.99,.999), show.vert.gridlines=TRUE, show.horiz.gridline=TRUE, lty=2 ) q1 <- qcauchy(t.info$ecdf) # Cauchy quantiles ecdfHT.draw( t.info, q1, t.info$ecdf, col='red',show.ci=TRUE) q2 <- qnorm(t.info$ecdf,sd=sd(x)) # Gaussian quantiles ecdfHT.draw( t.info, q2, t.info$ecdf, col='green',show.ci=TRUE) title(paste("simulated Cauchy data, n=",length(x),"\nred=Cauchy cdf, green=normal cdf")) x <- seq(-5,5,1) t <- c(-3,0,3) ecdfHT.h(x,t) p <- seq(0.05,.95,.1) q <- c(.1,.5,.9) ecdfHT.g(p,q)
Compute an interpolation of the transformed cdf in the middle with parametric power law decay on the tails.
ecdfHT.fit(p, transform.info, x.min = NA, x.max = NA, add.to.plot = TRUE, weights = "var", ...) ecdfHT.fit.tails(p, transform.info, weights, add.to.plot = TRUE, ...)
ecdfHT.fit(p, transform.info, x.min = NA, x.max = NA, add.to.plot = TRUE, weights = "var", ...) ecdfHT.fit.tails(p, transform.info, weights, add.to.plot = TRUE, ...)
p |
Vector of 2 probabilities that identify the quantile where data is cut to fit power decay on lower/upper tail. Set tail.p[1]=0 to exclude lower tail fit; tail.p[2]=1 to exclude upper tail fit. |
transform.info |
List containing transformation information, returned from |
x.min |
Number describing cut-off of lower tail |
x.max |
Number describing cut-off of upper tail |
add.to.plot |
Boolean indicating whether or not the interpolation is plotted |
weights |
'none' to do unweighted regression or 'var' to use weighted regression on tail with weights proportional to variance of quantile |
... |
Optional parameters passed to plot routines, e.g. col='red' |
An object of class 'ecdfHT.fit' specifying the interpolation. The fields in the value are:
vector of length 3, copied from the input argument
vector of length 3, the quantiles from the data corresponding to scale.q
vector of the sorted, unique data values
nonstandard empirical cdf, see details
transformed x values: xx[i]=h(xsort[i])
transformed p values: yy[i]=g(ecdf[i])
monotonic spline function used to compute the cdf
monotonic spline function used to compute the inverse of the cdf
vector of length 2; probabilities saying where the lower and upper tails begin. Note these are generally not the exact values of input variable p, rather they are the closest values to those found in ecdf
vector of length 2; x values where the lower and upper tails begin
vector of length 2; tail constants for lower and upper powerlaw fit
vector of length 2; slope of tails on transformed plot
vector of length 2; exponents for lower and upper power law fit
integer vector of length 2; indices in xsort where tails begin
copy of input variable weights
x <- rcauchy( 1000 ) a <- ecdfHT( x ) fit <- ecdfHT.fit( c(.1,.9), a, col='red' ) str(fit)
x <- rcauchy( 1000 ) a <- ecdfHT( x ) fit <- ecdfHT.fit( c(.1,.9), a, col='red' ) str(fit)
Transform multivariate data and plot using the ideas from the univariate plot.
ecdfHT.multivar(x, scale.q = matrix(c(0.25, 0.5, 0.75), nrow = 3, ncol = ncol(x)), q0 = 0.5, radii.upper.tail.p = 0.9, p.norm = 2, show.axes.labels = FALSE, zscale = c(500, 1), ...) ecdfHT.multivar.transform(x, scale.q, q0, p.norm) ecdfHT.2d(multivar.obj, zscale = c(500, 1), ...) ecdfHT.2d.axes(zscale) lp.norm(x, p.norm)
ecdfHT.multivar(x, scale.q = matrix(c(0.25, 0.5, 0.75), nrow = 3, ncol = ncol(x)), q0 = 0.5, radii.upper.tail.p = 0.9, p.norm = 2, show.axes.labels = FALSE, zscale = c(500, 1), ...) ecdfHT.multivar.transform(x, scale.q, q0, p.norm) ecdfHT.2d(multivar.obj, zscale = c(500, 1), ...) ecdfHT.2d.axes(zscale) lp.norm(x, p.norm)
x |
Matrix of data of size (n x d) |
scale.q |
matrix of sixe (3 x d), probabilities used to determine the scaling and centering for each component |
q0 |
quantile of radii transformation |
radii.upper.tail.p |
probability used as cutuoff to tail fit; set to 1 to suppress upper tail fit |
p.norm |
Power used in computing L^p norm |
show.axes.labels |
Boolean value, determines if axes are labeled or not |
zscale |
Vector of length 2, value of aspect ratio for the z axis when d=2 and the two 3d plots are drawn |
... |
Optional graphical parameters, e.g. col='red' |
multivar.obj |
An object of class 'ecdfHT.multivar', see details. |
ecdfHT.multivar
gives a quick graphical look at a d dimensional data set.
It produces two plots: the first is a superposition of the univariate ecdfHT
plots
for each component; the second plot is an array of plots, showing one plot
for each component.
ecdfHT.bivar
produces two plots of a bivariate data set.
The first one has three subplots:
a scatter plot of the data, a transformed scatterplot of the data, and a univariate
ecdfHT
plot of the radii of the data.
For the second and third subplot, ???? then g(y[,1]) is plotted against g(y[,2]) to get the second plot.
For the third plot, compute radius r[i]= l_p norm of shifted and scaled data. These radii are
plotted in a univariate, one-sided ecdfHT
plot.
The second plot produced is a 3-dimensonal plot. It takes the first two subplots just described and adds a
third dimension by looking at an ecdf for the radii. Thus the height of a point is low if the point is
near the center, and increases as points move away. The first subplot shows points
(x[i,1],x[i,2],ecdf of r[i]). The second plot transforms all three components:
it shows (h1(x[i,1]),hs(x[i,2]),g(ecdf of r[i]), where h1(.) and h2(.) are scaled versions of
h(.) from the univariate ecdfHT
plot, and g(.) is as in the univariate plot.
See the vignette for more detail.
ecdfHT.multivar
draws several plots, returns a list (invisibly) with fields:
input (n x d) matrix of data
(n x d) matrix of centered and shifted version of x
(n x d) matrix of transformed x
what p-norm to use; p.norm=2 is Euclidean norm
copy of input argument
vector of length n, p-norm of the rows of x.prime
copy of input value
q0-th quantile of the radii
list of length d, with j-th entry the object returned by ecdfHT
for the j-th column of x
list returned from ecdfHT( radii, ... )
object returned from ecdfHT.fit for the radii )
rgl id of 3d plot(s); can be used to access, change, print 3d plots
if d=2, this vector gives the empirical cdf of the radii
if d=2, this vector gives the transformed empirical cdf of the radii
ecdfHT.multivar.transform
computes the transformed vectors y, radii, and
lp.norm
computes the lp-norm of the rows of x
# independent components set.seed(2) x <- matrix( rcauchy(4000), ncol=4 ) ecdfHT.multivar( x ) # radially symmetric r <- rcauchy(1000) theta <- runif(1000,min=0,max=2*pi) x <- cbind( r*cos(theta), r*sin(theta) ) ecdfHT.multivar( x )
# independent components set.seed(2) x <- matrix( rcauchy(4000), ncol=4 ) ecdfHT.multivar( x ) # radially symmetric r <- rcauchy(1000) theta <- runif(1000,min=0,max=2*pi) x <- cbind( r*cos(theta), r*sin(theta) ) ecdfHT.multivar( x )
Use the semi-parametric fit calculated by ecdfHT.fit
to evaluate the cdf F(x), pdf f(x), quantiles and simulate.
pecdfHT(x, ecdfHT.fit) decdfHT(x, ecdfHT.fit) qecdfHT(p, ecdfHT.fit) recdfHT(n, ecdfHT.fit)
pecdfHT(x, ecdfHT.fit) decdfHT(x, ecdfHT.fit) qecdfHT(p, ecdfHT.fit) recdfHT(n, ecdfHT.fit)
x |
A vector of numbers |
ecdfHT.fit |
An object returned by |
p |
Vector of probabilites |
n |
Number of values to simulate |
pecdfHT
computes the cdf of the semi-parametric fit to the data.
decdfHT
computes the pdf of the semi-parametric fit to the data. This is likely very irregular and not of much value except on the tails, where the pdf calculation is computed analytically.
qecdfHT
computes quantiles.
recdfHT
simulates from a semi-parameteric distribution.
pecdfHT
computes the cdf, decdfHT
computes the pdf, qecdfHT
computes the quantiles (inverse of the cdf), recdfHT
simulates from the distribution.
x <- rcauchy(1000) a <- ecdfHT( x, show.plot=FALSE ) fit <- ecdfHT.fit( c(.1,.9), a, add.to.plot=FALSE ) pecdfHT( -3:3, fit ) decdfHT( -3:3, fit ) qecdfHT( seq(.1,.9,.1), fit ) recdfHT( 10, fit )
x <- rcauchy(1000) a <- ecdfHT( x, show.plot=FALSE ) fit <- ecdfHT.fit( c(.1,.9), a, add.to.plot=FALSE ) pecdfHT( -3:3, fit ) decdfHT( -3:3, fit ) qecdfHT( seq(.1,.9,.1), fit ) recdfHT( 10, fit )