The Log Transformation

STATA Module


This material will NOT be covered on the Diagnostic Exam, although you will need to understand how to manipulate logarithms (see Basic Math).


One approach to dealing with violations of the normality assumption behind the t-test is to conduct transformations of the data.  The goal of the transformation is to mathematically manipulate the data so it becomes more normally distributed.  There are multiple potential transformations, but I would argue the most prevalent in the environmental and environmental policy worlds is the natural logarithm transformation.  Because of its mathematical properties (its multiplicative nature), the natural log will reduce the size of large outliers and tends to spread out the data in the lower end of the distribution, frequently making the new distribution more normally distributed.  If you need to review the properties of logarithms, please go to the Statistics Review Website.


In this module, we will use the municipal climate change data that Professor Albright compiled to examine factors that influence local-level climate change adoption.  For the purpose of this module, Professor Albright restricted the data set to the year 2005.  There are 81 cities in this data set (cities in the eastern portion of the US with populations greater than 100,000), 22 of which had signed the US Conference of Mayors Climate Protection Agreement by the end of 2005.  If you would like to know more about this climate agreement, go to the US Conference of Mayors‘ website.


In this module, we will test whether cities that have signed the agreement have larger populations as compared to those cities who have NOT signed the agreement. The data was collected from the US Census Bureau for the year 2005.




Ho:  Mean Population of Signatory Cities –  Mean Population of Non-Signatory Cities ≤ 0


Ha:  Mean Population of Signatory Cities –  Mean Population of Non-Signatory Cities > 0


We could also write out the hypotheses in terms of Greek symbols. We should use Greek symbols because hypotheses are always about populations.


Ho: μ(pop) signatory – μ(pop)non-signatory ≤ 0

Ha: μ(pop) signatory – μ(pop)non-signatory > 0


Descriptive Statistics


After developing our hypotheses, the first step in any analysis is to examine and summarize the data.  The two variables of interest are Population and Climate (a binary variable 0=non-signatory; 1=signatory city).  Below you will find the boxplots of the 2005 unemployment rates of non-signatory and signatory cities. To produce the side-by-side boxplots, I used the STATA code (state code in green, variable names in orange):

STATA CODE: graph box population, by(climate)




To calculate the summary statistics, I used the STATA code:

STATA CODE:  by climate, sort : summarize population, detail


Screen Shot 2014-06-09 at 6.40.42 PM

Looking at the two distributions and summary statistics, some information should jump out at us.  First, the sample sizes are not equal. Second, the samples do not appear to be from normally distributed populations (box plots). The standard deviations of the two samples are quite different.  In comparison of means tests, violations of the assumption of normality of the underlying populations become more problematic when the sample sizes are not equal and the distributions are not similar in shape.  Therefore, we should try a transformation to reach (ideally) a more normal distribution. Remember we make assumptions about the underlying POPULATIONS, not the samples. We use our sample data to examine whether our assumptions about the populations are valid.


Log Transformed Distributions


So I went ahead and log transformed the variable population using the following STATA command:

STATA CODE:  generate logpop = log(population)


I then developed box plots of the two distributions.

STATA CODE:  graph box logpop, by(climate)




While the box plots above do not suggest perfectly normally distributed data, these distributions are much closer to normal distributions than the untransformed data. We could test whether these transformed data are sampled from a normal distribution using normality tests such as the Shapiro Wilk test.




Two Independent Sample, Unequal Variance Test


We now want to run our two independent sample, unequal variance test on our samples.

STATA CODE:   ttest logpop , by(climate) unequal



Screen Shot 2014-06-09 at 7.08.32 PM




Because we used a log transformation, we need to be mindful about the interpretation of our comparison mean results. First, we need to speak of the MEDIANS of the populations (not the means). Secondly, we need to speak of multiplicative factors between populations–not differences.

The two independent sample t-test suggests that the median population in cities that are signatory to the US Conference of Mayors Climate Protection Agreement is greater than median population in cities that are not signatory to this agreement (t=1.439, df=29.4, p=0.08).  The t-statistic is negative in the STATA output due to STATA setting up the hypotheses in terms of mean(0 (non-signatory)) – mean(1 (signatory). Our hypotheses were established in reverse order [mean(1) – mean(0)], and therefore we need to switch the sign on the t-statistic.  This result is moderately/marginally statistically significant (p=0.08). In other words, we could say that there is marginal to moderate evidence to conclude that the median population of signatory cities is larger than the median of non-signatory cities.


It is estimated that the median population of signatory cities is 1.33 times greater than the median population of non-signatory cities (1.33 = e(12.44521 – 12.15561)).  The 95% confidence interval around this multiplicative factor is (0.885, 2.02).  We calculate this confidence interval by taking [e(-0.121655), e(0.700842)].



PHOTO CREDIT: Margaret Louey, PhD, Belize.




Comments are closed.