Wednesday, March 20, 2013

Violin plots and regional income distribution

While preparing my slides for statistical graphics, a plot really caught my eye when I was playing around with the data.

I started off by plotting the time seriesof GNI per capita by country, and as expected it got quite messy and incomprehensible.

## Download and manipulate the data
library(FAOSTAT)
raw.lst = getWDItoSYB(indicator = c("NY.GNP.PCAP.CD", "SP.POP.TOTL"))
raw.df = raw.lst[["entity"]]
traw.df = translateCountryCode(raw.df, from = "ISO2_WB_CODE", to = "UN_CODE")
mraw.df = merge(traw.df, FAOregionProfile[, c("UN_CODE", "UNSD_MACRO_REG")])
final.df = mraw.df[!is.na(mraw.df$UNSD_MACRO_REG), ]

## Simple ugly time series plot
ggplot(data = final.df, aes(x = Year, y = NY.GNP.PCAP.CD)) +
    geom_line(aes(col = Country)) +
    labs(x = NULL, y = "GNI per capita")

plot of chunk unnamed-chunk-1

So I decided to compute the weighted average by region to examine the regional trends.

## Compute regional aggregates based on UN M49 definition
reg.df = aggRegion(aggVar = "NY.GNP.PCAP.CD", weightVar = "SP.POP.TOTL",
    data = traw.df, keepUnspecified = FALSE, aggMethod = "weighted.mean",
    relationDF = data.frame(UN_CODE = FAOregionProfile[, "UN_CODE"],
                            REG_NAME = FAOregionProfile[, "UNSD_MACRO_REG"]))

## Plot regional aggregates
ggplot(data = reg.df[!is.na(reg.df$NY.GNP.PCAP.CD), ],
    aes(x = Year, y = NY.GNP.PCAP.CD)) +
    geom_line(aes(col = REG_NAME)) +
    labs(x = NULL, y = "GNI per capita", col = "")

plot of chunk unnamed-chunk-2

I can now see the trend clearly, but there are two problems with this approach. First, the variability within region is vast and thus the weighted average or any summary statistic such as quantile can be misleading and it does not tell me what is going on within the regions. Secondly, since a minimum of 65% of the country must be present in order to compute the aggregation, no statistics was available prior to 1985.

While I was carrying out regional comparisons with box-plot and violin plot I thought why not plot them accross time as well! So here is the final graph:

## Time series violin plot
ggplot(data = final.df,
       aes(x = as.character(Year), y = NY.GNP.PCAP.CD)) +
    geom_violin() + scale_y_log10() +
        facet_wrap(~UNSD_MACRO_REG, ncol = 1, scales = "free_y") +
    scale_x_discrete(breaks = as.character((seq(1960, 2010, by = 10))),
                     labels = as.character((seq(1960, 2010, by = 10)))) +
    labs(x = NULL, y = "GNI per capita")

plot of chunk unnamed-chunk-3

Now I can compare the regions, but at the same time I can see the within region income distribution. It amazes me how the income distribution diverges in Europe and Oceania while America and Asia moves towards a bell shaped distribution. Growth in Africa appears to be slow, but there are several countries which are growing at a faster rate and pushing the tail of the distribution. Although some of the variability in the density may have resulted from independence of countries, nonetheless it is still infromative.

No comments:

Post a Comment