Preface
This series is aimed at providing tools for an electrical engineer to gain confidence in the performance and reliability of their design. The focus is on applying statistical analysis to empirical results (i.e. measurements, data sets).
Introduction
This article will introduce linear regression on a data set using the R Project software. This is useful if your data is "on a line" rather than a Gaussian distribution.
If you are not familiar with statistics or need a brush up I recommend Schaum's Statistics. It provides a good overview of material without a lot of time spent on proofs and lots of examples.
Concepts
Estimator Theory: a branch of statistics that deals with estimating the values of parameters based on measured empirical data that has a random component. An estimator attempts to approximate the unknown parameters using the measurements.
Estimator: a method for calculating an estimate of a given quantity based on observed data. Many examples can be found at the Wikipedia article.
Linear Regression: a method for estimating a linear function using a least squares estimator from measured data. Quite simply it means to take a data set and derive a linear function (y=mx+b) from that data. Once you have a function you can calculate how close the data is to the function (the error or even a distribution).
The equation for linear regression is of the form y=m*x+b, where b is the y-intercept (or offset) and m is the slope of the line.
Importing Your Data Set
I will use the R software package for statistical analysis. It is cross platform, free and open source. There are several Excel plugins which are good and if you have/can use SAS by all means use it.
The first row of your data set should be the titles for each column. Each column can contain anything but for building a distribution we can assume a single column with a row for each measurement.
NOTE: The following assumes we are testing an implementation of a 5V 8-bit DAC on a PCB. The data set contains two samples. In a real product I would probably want many samples to build a distribution.
column1: DAC input value
column2: perfect DAC output
column3: sample1
column4: sample2
> dac_out<-read.csv(file.choose())
> dim(dac_out)
[1] 256   4
> tail(dac_out)
    dac.level     perf  sample1  sample2
251       251 4.902344 5.042472 5.463848
252       252 4.921875 5.099334 5.492536
253       253 4.941406 5.052925 5.479490
254       254 4.960938 5.118694 5.505745
255       255 4.980469 5.082373 5.524927
256       256 5.000000 5.124489 5.510183
> str(dac_out)
'data.frame':   256 obs. of  4 variables:
 $ dac.level: int  1 2 3 4 5 6 7 8 9 10 ...
 $ perf     : num  0.0195 0.0391 0.0586 0.0781 0.0977 ...
 $ sample1  : num  0.164 0.146 0.175 0.221 0.271 ...
 $ sample2  : num  -0.437 -0.438 -0.346 -0.326 -0.369 ...
> attach(dac_out)
Finding the Linear Regression Equation Coefficients
First we perform the linear regression. Let's use sample1 as our data set.
> dac_out.lm=lm(sample1 ~ dac.level, data=dac_out)
> coeffs = coefficients(dac_out.lm); coeffs
(Intercept)   dac.level
  0.1478863   0.0195291
We can see from the coefficients that there is a DC offset of about 0.15V and a slope of 0.0195. It is important to note that a plot of DAC input versus output for a perfect 5V, 8bit DAC has a slope of 0.0195. So we can see here that the slope error is nearly zero. The DC offset is considerable, several DAC bits (5*(1/2^8)=0.0195V). Maybe the DAC analog rail is high by 0.15V.
Coefficient of Determination
We can also check how well our function fits the data set by calculating the coefficient of determination. The value is between 0 and 1:
> summary(dac_out.lm)$r.squared
[1] 0.9996259
This function fits the data exceptionally well. These numbers were cooked, I'd never expect to see this in a real data set.
Calculating Output Values
I find that I never need to know the values of a Gaussian distribution equation (only area under the curve), but frequently do with linear data sets. We can use our function to estimate the DAC output based on DAC input value:
DAC Level 0:
> coeffs[1]+coeffs[2]*0
(Intercept)
  0.1478863
DAC Level 127:
> coeffs[1]+coeffs[2]*127
(Intercept)
   2.628082
DAC Level 256:
> coeffs[1]+coeffs[2]*256
(Intercept)
   5.147337
Significance Test
If we print the entire summary we can see the p-values for this data set and check how well the output tracks to the input. Hypothesis testing and p-values are covered in a later section.
> summary(dac_out.lm)
Call:
lm(formula = sample1 ~ dac.level, data = dac_out)
Residuals:
      Min        1Q    Median        3Q       Max
-0.047289 -0.025417 -0.000074  0.024516  0.052090 
Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.1478862  0.0035137   42.09   <2e-16 ***
dac.level   0.0195291  0.0000237  823.89   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.02803 on 254 degrees of freedom
Multiple R-squared:  0.9996,    Adjusted R-squared:  0.9996
F-statistic: 6.788e+05 on 1 and 254 DF,  p-value: < 2.2e-16
As expected the p-value is very low.
Confidence Interval
It is probably more useful to know the confidence interval for a given value. In our case for a given DAC value how much can we expect the output to vary?
> newdata=data.frame(dac.level=128)
> predict(dac_out.lm, newdata, interval="confidence")
       fit      lwr      upr
1 2.647611 2.644162 2.651061
> predict(dac_out.lm, newdata, interval="confidence", level=0.95)
       fit      lwr      upr
1 2.647611 2.644162 2.651061
> predict(dac_out.lm, newdata, interval="confidence", level=0.99)
       fit      lwr      upr
1 2.647611 2.643065 2.652158
> predict(dac_out.lm, newdata, interval="confidence", level=0.999)
       fit      lwr      upr
1 2.647611 2.641779 2.653443
If a single bit varies by 0.02V then at at 0.99 confidence level this DAC is quite enough to produce 8 bits of resolution (2.652158-2.643065=0.009093). However we cannot ignore the offset which will produce an error in our output.
Sample 2
Let's repeat the steps for our second sample:
> dac_out.lm2=lm(sample2 ~ dac.level, data=dac_out)
> coeffs2 = coefficients(dac_out.lm2); coeffs2
(Intercept)   dac.level
-0.44788333  0.02344204
> summary(dac_out.lm2)
Call:
lm(formula = sample2 ~ dac.level, data = dac_out)
Residuals:
      Min        1Q    Median        3Q       Max
-0.053173 -0.023877  0.000947  0.026726  0.047597 
Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.479e-01  3.713e-03  -120.6   <2e-16 ***
dac.level    2.344e-02  2.505e-05   936.0   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.02961 on 254 degrees of freedom
Multiple R-squared:  0.9997,    Adjusted R-squared:  0.9997
F-statistic: 8.76e+05 on 1 and 254 DF,  p-value: < 2.2e-16
> newdata=data.frame(dac.level=128)
> predict(dac_out.lm2, newdata, interval="confidence", level=0.999)
       fit      lwr     upr
1 2.552698 2.546536 2.55886
This sample has some serious problems. There is a massive negative offset indicating a threshold that must be overcome before any output is observed. Also the slope of 0.0234 cannot be ignored. Compared to a perfect slope of ~0.0195 it will introduce a pretty big error over the entire DAC range.
Next Up
Next article will show how hypothesis testing can provide insight into your debug efforts by questioning how much effect a circuit modification really has on your design.