Dear Statalist,
I am aware that this is not an entirely new topic, but I still have some questions regarding the handling of factor analysis with missing values.
The context:
I am analyzing a small sample microeconomic dataset on the determinants of individual life satisfaction. I configured a basic regression model which contains some 26 variables, most of them in the form of binary variables (they were originally asked on an ordinal scale). I have missing values in 5 of these variables, missingness ranging from 1-23%. Instead of using multiple imputation, I have decided to use data from another household survey on the same population to fill in missings for the variable with the highest share of missing values. Other than that, I decided to continue with complete case analysis.
In a second step, I want to add more control variables to analyze my main vairbales of interest. Do to so, I want to run factor analysis on a set of questions I included in the survey. Unfortunately, there is a small portion of missing values in 3 of the 12 variables (0.5-5% missingness) that I wanted to use for factor analysis. As my sample size is rather small already (N=191) and accounting for statistical power, (further) estimation bias and comparability of my basic model and the extended model, I am looking for a way to calculate factors while keeping the number of observations the same. I have run factor analysis on the complete cases and it seems like my initial questions worked well, ie. I can find meaningful factors.
I have done some intense reading on imputation methods these past weeks and I think I now grasp the basics, as well as advantages and difficulties of that approach. Also, I understand that combining mi and factor analysis poses a number of technical and theoretical issues. I found the paper by Truxillo (2005) outlining an EM algorithm to deal with missing data in factor analysis as well as the practical explanation on http://www.ats.ucla.edu/stat/stata/f...or_missing.htm. It seems straightforward to me that estimating the sull variance-covariance matrix allows to calculate factors which account for missing values. However, since this method does not do any actual imputation, it does not solve my problem as soon as I run my regression model. Or does it?
Also, people have suggested doing factor anaylsis on each of the imputet datasets and then pool the results. Here, I am not even theoretically sure how I would be supposed to pool the resulting loadings and extracted factor scores and I can't seem to understand how to technically use the individual imputed datasets. If the -mi estimate- command is not to be combined with -factormat- how would I be able to run factor analysis on the imputet datasets?
Can anyone think of an elegant way to solve this issue, ie. keeping the number ob observations constant and still do factor analysis? Or do you think that I will have to make to with complete case analysis instead?
Best,
Laura
I am aware that this is not an entirely new topic, but I still have some questions regarding the handling of factor analysis with missing values.
The context:
I am analyzing a small sample microeconomic dataset on the determinants of individual life satisfaction. I configured a basic regression model which contains some 26 variables, most of them in the form of binary variables (they were originally asked on an ordinal scale). I have missing values in 5 of these variables, missingness ranging from 1-23%. Instead of using multiple imputation, I have decided to use data from another household survey on the same population to fill in missings for the variable with the highest share of missing values. Other than that, I decided to continue with complete case analysis.
In a second step, I want to add more control variables to analyze my main vairbales of interest. Do to so, I want to run factor analysis on a set of questions I included in the survey. Unfortunately, there is a small portion of missing values in 3 of the 12 variables (0.5-5% missingness) that I wanted to use for factor analysis. As my sample size is rather small already (N=191) and accounting for statistical power, (further) estimation bias and comparability of my basic model and the extended model, I am looking for a way to calculate factors while keeping the number of observations the same. I have run factor analysis on the complete cases and it seems like my initial questions worked well, ie. I can find meaningful factors.
I have done some intense reading on imputation methods these past weeks and I think I now grasp the basics, as well as advantages and difficulties of that approach. Also, I understand that combining mi and factor analysis poses a number of technical and theoretical issues. I found the paper by Truxillo (2005) outlining an EM algorithm to deal with missing data in factor analysis as well as the practical explanation on http://www.ats.ucla.edu/stat/stata/f...or_missing.htm. It seems straightforward to me that estimating the sull variance-covariance matrix allows to calculate factors which account for missing values. However, since this method does not do any actual imputation, it does not solve my problem as soon as I run my regression model. Or does it?
Also, people have suggested doing factor anaylsis on each of the imputet datasets and then pool the results. Here, I am not even theoretically sure how I would be supposed to pool the resulting loadings and extracted factor scores and I can't seem to understand how to technically use the individual imputed datasets. If the -mi estimate- command is not to be combined with -factormat- how would I be able to run factor analysis on the imputet datasets?
Can anyone think of an elegant way to solve this issue, ie. keeping the number ob observations constant and still do factor analysis? Or do you think that I will have to make to with complete case analysis instead?
Best,
Laura