Hi,
I'm struggling with the geometric mean computation in the following case.
I need to create composite indexes based on the geometric (row) mean of multiple variables. The indexes are composed of a different number of variables, and the variables have different distribution.
I created a syntax following these steps:
1) standardization of the variables by generating a "modified z-scores" based on median absolute deviation (to minimize the impact of extreme values);
2) log transformation: store the sign of the values before the logarithmic transformation and log transform abs(`var'), adding 1 so it returns zeros when `var' == 0
3) exponentiate the arithmetic rowmean of the log transformed variables: store its sign, exponentiate it, substract 1, and restore its sign.
This syntax is:
//Step1 - standardization: compute "modified z-scores" (based on median absolute deviation to minimize the impact of extreme values)
Code:
foreach var of varlist v* {
qui su `var', det
gen double `var'_zsco = ((`var'-`r(p50)')/`r(p50)')* 0.6745
}
//Step 2 - logarithmic transformation
Code:
foreach var of varlist *zsco {
//store the sign of the values before the logarithmic transformation
gen s_`var' = .
replace s_`var' = -1 if `var' < 0 & `var' != .
replace s_`var' = 1 if `var' > 0 & `var' != .
replace s_`var' = 1 if `var' == 0 & `var' != . /*to avoir missing values for (zsco==0)*/
//logarithmic transformation of `var', adding 1 so it returns zeros when `var' == 0
gen double i_`var' = ln(1+(abs(`var')))*s_`var'
}
//Step 3 - compute the arithmetic rowmean of the ln transformed variables and
Code:
egen double i_Mean = rmean(i_*)
foreach var of varlist i_Mean {
//store the sign of the values of var
gen s_`var' = .
replace s_`var' = -1 if `var' < 0 & `var' != .
replace s_`var' = 1 if `var' > 0 & `var' != .
replace s_`var' = 1 if `var' == 0 & `var' != .
// exponentiate the arithmetic mean
gen double exp_`var' = (exp(abs(`var')))-1
//restore the sign of var values
replace exp_`var' = s_`var'*exp_`var'
}
I created an independent check for rows with positive z scores only (as the gmean() function for egen in egenmore (SSC) ignores zeros and negatives).
Taking for granted that step 1 is irrelevant for the actual problem, I simulated steps 2 and 3 on a previous exmaple provided by Nick (
https://www.statalist.org/forums/for...62#post1360962)
I get very close values to what my syntax generate, but it is not an exact match (I get a .9948 correlation), and I just can't find why and where is my mistake.
All the values I get from my own Steps 2 and 3 slightly higher then the expected values.
//Generating example data
Code:
clear
set obs 10
set seed 2803
forval j = 1/5 {
gen y`j' = ceil(100 * (runiform()^2))
}
list
+-------------------------+
| y1 y2 y3 y4 y5 |
|-------------------------|
1. | 86 63 45 8 1 |
2. | 12 40 73 100 4 |
3. | 60 1 74 61 4 |
4. | 2 1 4 2 54 |
5. | 12 1 22 22 4 |
|-------------------------|
6. | 1 7 15 84 14 |
7. | 4 1 12 94 7 |
8. | 40 2 15 2 89 |
9. | 16 34 25 7 6 |
10. | 15 6 3 44 6 |
+-------------------------+
//Generating expected gmean values
Code:
gen double M1 = y1
quietly forval j = 2/5 {
replace M1 = M1 * y`j'
}
replace M1 = exp(log(M1)/5)
list
//independent check 2 proposed by Nick
Code:
matrix test = (86, 63, 45, 8, 1)
gen test = test[1, _n]
means test
egen gmean = mean(ln(test))
replace gmean = exp(gmean)
means test
Variable | Type Obs Mean [95% Conf. Interval]
-------------+---------------------------------------------------------------
test | Arithmetic 5 40.6 -4.225618 85.42562
| Geometric 5 18.11458 1.794746 182.8326
| Harmonic 5 4.256322 . .
-----------------------------------------------------------------------------
Missing values in confidence intervals for harmonic mean indicate
that confidence interval is undefined for corresponding variables.
Consult Reference Manual for details.
//Applying my syntax
//Step 2 - log transformation
Code:
foreach var of varlist y* {
//store the sign of the values before the log transformation
gen s_`var' = .
replace s_`var' = -1 if `var' < 0 & `var' != .
replace s_`var' = 1 if `var' > 0 & `var' != .
replace s_`var' = 1 if `var' == 0 & `var' != . /*to avoid missing values when var ==0)*/
//log transformation of `var', adding 1 so it returns zeros when `var' == 0
gen double i_`var' = ln(1+(abs(`var')))*s_`var'
}
//Step 3 - compute the arithmetic rowmean of the ln transformed variables and
Code:
egen double i_Mean = rmean(i_*)
foreach var of varlist i_Mean {
//store the sign of the values of var
gen s_`var' = .
replace s_`var' = -1 if `var' < 0 & `var' != .
replace s_`var' = 1 if `var' > 0 & `var' != .
replace s_`var' = 1 if `var' == 0 & `var' != . /*to avoid missing values when var == 0*/
// exponentiate the arithmetic mean
gen double exp_`var' = exp(abs(`var'))-1
//restore the sign of var values
replace exp_`var' = s_`var'*exp_`var'
}
list y1 y2 y3 y4 y5 M1 exp_i_Mean
+-------------------------------------------------+
| y1 y2 y3 y4 y5 M1 exp_i_M~n |
|-------------------------------------------------|
1. | 86 63 45 8 1 18.114581 20.515226 |
2. | 12 40 73 100 4 26.873536 27.83036 |
3. | 60 1 74 61 4 16.104771 18.52345 |
4. | 2 1 4 2 54 3.8663641 4.4817729 |
5. | 12 1 22 22 4 7.4682237 8.2785434 |
|-------------------------------------------------|
6. | 1 7 15 84 14 10.430841 11.669224 |
7. | 4 1 12 94 7 7.9413333 8.975884 |
8. | 40 2 15 2 89 11.639123 12.966184 |
9. | 16 34 25 7 6 14.169602 14.40053 |
10. | 15 6 3 44 6 9.3453063 9.713163 |
+-------------------------------------------------+
Any help figuring out where is my mistake would be very appreciated!
Best,
Martin