Monday, October 8, 2012

Cooking with Statistics

I received my copy of "La Cucina - The regional cooking of Italy" from Accademia Italiana Della Cucina today and I can't wait to dig into all 2000 Italian regional recipes in the book!

I have always believed that everything has a mathematical form, and they are every where in our life.

Take cooking for example, a recipe is a model, ingredients are like data, and the chef turns the ingredients into magnificent dishes just like how a statistician would turn data into useful insight.

Furthermore vegetables, meat and all other inputs just like data are not always the same. Time-series data are subject to sampling frequency, and surveys are prone to non-response. Ingredients can also be good or bad and have a distribution they follow.

A bad cook like me would have blindly follow the recipe and make a terrible dish. On the other hand, a good chef would analyse the ingredients by looking, smelling and even taste it to determine the state of the ingredients just like a good analyst would plot the data to grasp the feeling of the data.

The only difference is that a statistician would analyse all these information with a computer, while the chef analyse all these in his personal computer (brain and senses).

I wonder would it be possible one day that when we can measure all the aspects of taste, a terrible cook like me can turn into a Master chef!

Wednesday, October 3, 2012

Perculiar behaviour of the sum function

The sum function in R is a special one in contrast to other summary statistics functions such as mean and median. The first distinguish is that it is a Primitive function where the others are not (Although you can call mean using .Internal). This causes many inconsistency and unexpected behaviours.

(1) Inconsistency in argument

For example, the arguments are inconsistent. Both mean and median takes the argument x, while the sum operates on whatever argument that is not matched. This can be a problem in the case when you want to write a function which switches between all the summary functions such as:

do.call(myFUN, list(x = x))

Where myFun can be any statistical summary function. The problem first arises when I wanted to write a function which encompasses several different summary statistics and so I can switch between them when required. The main problem arises when I have to pass additional arguments such as the "weight" in the weighted.mean function. I wrote the following call and naively hope it would work

do.call(myFUN, list(x = x, w = w))

What turns out is that this line of code works find for all the summary statistics except the sum function where the "weight" is also summed. So my current solution is just to use the switch function which is not my favourite function.

(2) Inconsistency in output

Another inconsistency arises in how the NA's are treated. In the mean, median and weighted.mean summaries; if all the observations are NA then either NA or NaN are returned.

mean(rep(NA, 10), na.rm = TRUE)
median(rep(NA, 10), na.rm = TRUE)

While the sum function returns zero. It puzzles me how you get zero when NA stands for not available and this is like creating something out of nothing. This is a problem for me since if I want to sum up multiple time series with missing values, I want the function to remove NA and compute where there are partial data while returning NA instead of zero when there are no data at all.

Nevertheless, a simple solution exists and thanks to the active R community. This post on R help addresses this problem and solve in an elegant manner.

sum(x, na.rm = any(!is.na(x)))


"The computations and the software for data analysis should be trustworthy" - John Chamber, Software for Data Analysis

I am not sure about the reasoning underlay the behaviour of sum, but it should be consistent so people can trust it and use it as what they expect.