go back

#######################################################################
####### Build a 3D, 2 var kernel density estimation with R from scratch
#######################################################################

1. Introduction goto
2. A simple histogram function goto
3. A simple kernel density estimation function goto
4. A multi-dimension kernel density estimation function goto
5. Using montecarlo to estimate the optimal bandwidth goto


1. Introduction gotop

Let's have the following variable

set.seed(1)
x=rnorm(30)

To represent the distribution of x, I can use the in-built function to obtain an histogram
and a kernel density estimation function.

par(mfrow=c(1,2))
hist(x)
plot(density(x),t="l")


I will try to build here these functions from scratch and add some features.

2. A simple histogram function gotop

I build a function with, in input, I enter a vector and in output I display an histogram of this
vector. In input, I can also choose the number of bars of the histogram.

f_histo=function(x,bar=7){
	a=seq(min(x),max(x),l=bar+1)
	o=sapply(1:bar,function(i){sum(if(i==bar){x>=a[i]}else{x>=a[i]&x<a[i+1]})})
	plot(NA,xli=c(min(a),max(a)),yli=c(0,max(o)),t="n",yla="Frequency",
		xla=substitute(x),m=paste("Histogram of",substitute(x)))
	v=sapply(1:bar,function(i){rect(a[i],0,a[i+1],o[i])})
}
f_histo(x)

First I separate the values of x in a finite number of same size ranges. Let's say 7.

bar=7
a=seq(min(x),max(x),l=bar+1)

Then I count the number of x values in each range. The sapply function allows me to use the vector
a I have just created and to count the number of dots in each range. Thanks to this function I
don't need to use a loop or a recursive function.

o=sapply(1:bar,function(i){sum(if(i==bar){x>=a[i]}else{x>=a[i]&x<a[i+1]})})

Now I just have to build a plot with different bar. The width a each bar will be the ranges in the
vector a and the height of each bar will be the numbers in the vector o. First I display an empty
plot. Then I use the function rect to draw the different bars in this plot.

plot(NA,xli=c(min(a),max(a)),yli=c(0,max(o)),t="n",yla="Frequency",
	xla=substitute(x),m=paste("Histogram of",substitute(x)))
sapply(1:bar,function(i){rect(a[i],0,a[i+1],o[i])})


3. A simple kernel density estimation function gotop

A kernel density estimation (KDE) is a kind of continuous version of the histogram. For a specific
value, for instance 0, the value of the KDE will increase when many values of x are near 0. To do
that, for each value of x, you apply the normal density function with mean 0 and a specific student
variation. And then you take the mean of all of these values and you obtain the KDE estimation for
0. For instance, if you take a student variation of 1, the KDE of 0 will be:

mean(dnorm(x,0,1))

Now we calculate the KDE estimation for 1 000 different values and we can draw a nice KDE graph:

f_density=function(x,bw=0.5){
	a=seq(min(x)-3*bw,max(x)+3*bw,l=1000)
	o=sapply(a,function(v){mean(dnorm(x,v,bw))})
	plot(a,o,t="l",xla=NA,yla="Density",m=paste("Density of",substitute(x)),
		s=paste("N =",length(x),", bandwidth =",bw))
}
f_density(x)


Another way to see KDE is that we add a white noise to our vector x. The vector x become smoother
and almost continous. For each of the 30 values of the vector x, I generate 10 000 random values
following a normal law with a student variation of 0.5 and a mean equals to the value of the x dot.
Then draw an histogram with 1 000 bars using our histogram function. We obtain again our KDE graph.

v=rnorm(length(x)*10000,x,0.5)
f_histo(v,bar = 1000)


4. A multi-dimension KDE function gotop

I improve a bit the function f_density. The input will be a dataframe with one or more column. The
output will be a dataframe with one column for every variables with a range of representative
values. There will be a last column with the KDE for these representative values. There will be
two others input variables: the bandwidth or the degree of smoothness of the KDE and the number of
reprensentative values to display in the output.

f_density2=function(x,bw=0.5,nsize=10000){
	n=dim(x)[2]
	for(i in 1:n){
		o1=seq(min(x[,i])-3*bw,max(x[,i])+3*bw,l=floor(nsize^(1/n)))
		if(i==1){o=as.data.frame(o1)}else{o=merge(o,o1)}
		names(o)[i]=paste("x",i,sep="")
	}
	f=function(i){
		if(i>=1){f(i-1)+sapply(1:dim(o)[1],function(v){mean(dnorm(x[,i],o[v,i],bw))})/n}
		else{0}
	}
	o=cbind(o,d=f(n))
}

First, I build a dataframe "o" with a range of representative values of the different variables of
x. For every variable I take a range of representative values. The number of representative values,
floor(nsize^(1/n)), is calculated so the final number of representative values in the dataframe o
will be equal to nsize or a bit less.

o1=seq(min(x[,i])-3*bw,max(x[,i])+3*bw,l=floor(nsize^(1/n)))

Then I do a cartesian product of all these vectors to obtain the dataframe o with my range of
representative values. As I do cartesian product, the size of o can increase very quickly. It's why
I restrain it to nsize.

n=dim(x)[2]
for(i in 1:n){
	o1=seq(min(x[,i])-3*bw,max(x[,i])+3*bw,l=floor(nsize^(1/n)))
	if(i==1){o=as.data.frame(o1)}else{o=merge(o,o1)}
	names(o)[i]=paste("x",i,sep="")
}

Then I calculate the KDE for every columns and I do the mean to have the final KDE. I use a
recursive function to do that. I add these column to the dataframe o which will be the output of my
function.

f=function(i){
	if(i>=1){f(i-1)+sapply(1:dim(o)[1],function(v){mean(dnorm(x[,i],o[v,i],bw))})/n}
	else{0}
}
o=cbind(o,d=f(n))

I can use this function on a data frame x which contains 2 variables x1 and x2. I use then plot3D to
represent the density function on a graph.

set.seed(1)
x1=rnorm(10)
x2=2*x1+rnorm(10)
x=cbind(x1,x2)

y=f_density2(x,bw = 1)
plot3D::points3D(x = y[,1],y = y[,2],z=y[,3],zlab="Density",xlab="x1",ylab="x2",
	main="Histogram of x",sub="N = 10, bandwidth = 1")

y=f_density2(x,bw = 0.1)
plot3D::points3D(x = y[,1],y = y[,2],z=y[,3],zlab="Density",xlab="x1",ylab="x2",
	main="Histogram of x",sub="N = 10, bandwidth = 0.1")




Game over! gotop