Quantcast
Channel: Statalist
Viewing all 65055 articles
Browse latest View live

Pane data: data structure and model question

$
0
0
Hello guys,

I posted something in reference to what I am going to ask here in a past post but I think it was necessary to create a new post. I want to start by saying that this project it’s very complicated and that this post relates more to the designing part of project. Any idea/suggestion will help, any!

I have a dataset containing Use of Force (UOF) incidents in a mental health facility. When the mental health staff have to use force against a patient, that incident is recorded. Details about the staff (can be more than one staff), patient (can be more than one patient) involved, as well as “environmental” characteristics of the incident (for example: location-housing unit, shift, population) are recorded. Below is a sample of how the data looks for an incident (not all variables are included)
.

FORCE_ID UOF_DATE SHIFT FACILITY PERSON_TYPE STAFF_ID STAFF_TITLE STAFF_EXP STAFF_DOB OFFICER_GENDER PATIENT_NAME PATIENT_DOB PATIENT_RACE PATIENT_SCORE PREVIOUS_UOF
10045 ########## 3 TO 11 MMD EMPLOYEE 1025 DOCTOR 4956 11/14/1977 F
10045 ########## 3 TO 11 MMD EMPLOYEE 4584 ASSISTANT 5214 1/8/1985 F
10045 ########## 3 TO 11 MMD INMATE MARIA 6/16/1985 BLACK 5 15
10045 ########## 3 TO 11 MMD INMATE JENNIFER 5/10/1999 WHITE 20 1

The ambitious general goal of this project is to develop a model that will predict which mental health providers are more like to use force. My first approach was to collapse the data base on the officer, and the inmate level creating a variable of the count of UOF for that year. For example, when I collapse the data to the officer level, each row represents an officer. I did the same thing for the inmate. Using these datasets, I just run OLD regression to see potential predictors of UOF variable (the number for the year DV). For example, I found that staff with a higher tittles have higher numbers of UOF in the year. Likewise, more experience staff (years in the job) have fewer UOF. Some predictors for patients are numbers of admission in the past, etc. I also collapse the data to the “incident” level just to run frequency tables at some crosstabs. Each row is an UOF incident. For example, I know that more UOF incidents occur in the morning shift compared to night (since all patients are sleeping)

Here are my questions: how can I combine the effect of both: officer and inmate, and even the “environmental” characteristics of the incident? What statistical model should I use? Should I use a mixed effect model? How should the data be structured to conduct analysis? If this data does is not sufficient? What type of data should I collect? I have some ideas on what to do but I would prefer to hear from you guys first.

Best,
Marvin

Propensity score matching - Differentiate impact by groups

$
0
0
Hi everyone,

My question concerns how to differentiate the results of an impact evaluation using propensity score matching (PSM).

Let’s assume a basic scheme. I want to evaluate the impact of a treatment (tr) on y (outcome).

First, I get the propensity score using:

. pscore tr x1 x2, pscore(ps) comsup // x1 & x2 are covariates

Skipping balancing steps (just for simplicity of my question), then I estimate the impact by using:

. psmatch2 tr, outcome(y) pscore(ps) n(5)

I’m interested in showing the impact of the treatment by men and women.
Now my question. How do I do that in the previous commands? Is this the correct way?

. psmatch2 tr if sex==”woman”, outcome(y) pscore(ps) n(5)
. psmatch2 tr if sex==”man”, outcome(y) pscore(ps) n(5)

Thanks in advance,
Wil

Millisecond problem

$
0
0
Using Stata 14 up-to-date, on a Windows 10 system, I have data with year, month, hour, day, minute, second, and millisecond as separate float variables. I created a variable that is minute.millisecond and then used mdyhms to create a datetime variable. It worked much of the time, but intermittently gets it wrong. In the example below, it is correct for observations 1 to 4, but then loses 1 millisecond in observation 5. It gets later milliseconds right. The error is not always mistaking .001 for 0 - it also mistakes .003 for .002 etc.

Any assistance would be appreciated.

Phil

g double secondms=second + .001 * ms
g double datetime=mdyhms(month,day,year,hour,minute,secondm s)
format datetime %20.5f
list minute second ms secondms datetime


+-------------------------------------------------------+
| minute second ms secondms datetime |
|-------------------------------------------------------|
1. | 4 48 17 48.017 1549065888017.00000 |
2. | 4 48 18 48.018 1549065888018.00000 |
3. | 4 48 19 48.019 1549065888019.00000 |
4. | 5 8 0 8 1549065908000.00000 |
5. | 5 8 1 8.001 1549065908000.00000 |
|-------------------------------------------------------|
6. | 5 15 0 15 1549065915000.00000 |
7. | 6 0 0 0 1549065960000.00000 |
+-------------------------------------------------------+





Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input float(month day hour minute second ms year) double(secondms datetime)
2 1 0 4 48 17 2009 48.017 1549065888017
2 1 0 4 48 18 2009 48.018 1549065888018
2 1 0 4 48 19 2009 48.019 1549065888019
2 1 0 5  8  0 2009      8 1549065908000
2 1 0 5  8  1 2009  8.001 1549065908000
2 1 0 5 15  0 2009     15 1549065915000
2 1 0 6  0  0 2009      0 1549065960000
end



Dealing with endogeneity of a dummy variable as a treatment effect

$
0
0
I am currently studying the Public Wage Premium in Sri Lanka. I have been looking at the literature on the switching regression and using a endogenous dummy variable model (1=public, 0=private) for wage employees. I then came across a paper by Wooldridge (2008) "Instrumental Variables Estimation of the Average Treatment Effect in the Correlated Random Coefficient Model" and was keen to apply it in my analysis. From this paper, I have a vague idea in my head but I am not sure if I am on the right track and whether it is a feasible approach. I am hoping I could get some advice on modelling it.

Here is my approach, which I'm sure is very flawed at present so I apologize for that:

1. Estimate probit (1=public, 0=private) or two probit regressions (for public and private employees seperately). I am not sure which is more suitable.

probit public age age2 years_in_education gender ethnicity

2. Obtain the predicted probabilities

predict p (say, p_hat)

3. The second stage will use IV to estimate the wage function (where the dummy variable is endogenous)

ivregress 2sls log_wage age age2 years_in_education gender ethnicity (public=father_in_public_sector spouse_in_public_sector p_hat), robust first

If I estimate two probit regressions instead of one, then I would end up with two correction terms for my second step IV (if I understood it correctly).

Does this sound like a sensible approach, or have I completely misunderstood the concepts? Ideally, I wish to employ a switching regression model while controlling for endogeneity of sector choice.

Thank you for the help

Reference:
Woolridge, J. (2008), "Instrumental Variables Estimation of the Average Treatment Effect in the Correlated Random Coefficient Model", Advances in Econometrics, 21, pp. 93 - 116

"not found in list of covariates"-error despite correct indicator variable specification

$
0
0
I have read http://www.statalist.org/forums/foru...293473-margins but the solution posted there (adding prefix i.) does not help in my case.

Data and specification
xi: svy: ivprobit sw_participate i.male i.i_age (i.i_msg_read=avg_logins)

sw_participate male are 0/1- indicators
age has 4 groups (min: 0, max 3)
avg_logins is continuous

Ideally, I would like to calculate the following margins command
margins i_msg_read, at(_Ii_age_1==1) at(_Ii_age_2==1) at(_Ii_age_3==1) baselevels vce(unconditional) asbalanced
'i_msg_read' not found in list of covariates

This error seems to have two dimensions, but not related to the instrumental variable (same for age, male etc.):

Question A: Why can I not calculate margins at specific points?
Even though I used the i. - specification, margins without the dydx option produces the error r(322)

E.g. does not work:
margins _Ii_msg_rea_1
margins _Ii_age_2
'_Ii_msg_rea_1' not found in list of covariates

E.g. works:
margins, dydx(_Ii_msg_rea_1)
margins, dydx(_Ii_msg_rea_1) at(_Ii_age_1==1) at(_Ii_age_2==1) at(_Ii_age_3==1) baselevels vce(unconditional) asbalanced
margins, dydx(_Ii_age_2)
Not very helpful since the marginal effect-value of msg_read is of course the same (average) value at all stages of age


Question B: Why do I have to enter the indicator variables created by xi ?
Entering the original i.variables name yields the same error r(322)

E.g. does not work:
margins, dydx(i_msg_read) baselevels vce(unconditional) asbalanced
margins, dydx( i_age) baselevels vce(unconditional) asbalanced
'i_age' not found in list of covariates

E.g. works:
margins, dydx(_Ii_msg_rea_1) baselevels vce(unconditional) asbalanced
margins, dydx( _Ii_age_2) baselevels vce(unconditional) asbalanced
Not very helpful since I am interested in _Ii_age_3 etc as well. Do I really have to enter them all separately?


I am sure I have overseen something obvious, but even after reading a lot, I cannot figure out what I missed.

Thank you for your help.

Loop over observations to open a large file

$
0
0
Dear statalists,

I am very new to Stata, and I would like to ask your help because I cannot create a loop to repeatedly open a very large dataset in smaller portions, in order to save it smaller parts, so that I could work with each separately
Suppose I have a large dataset file (dataset.dta), which I cannot open alltogether because too large for my PC. I would like to:
1) open it in the range 1/1000
2) save the file as dataset_1.dta
3) close it.
4) open 1001/2000 and restart the process until the entire dataset has been saved.

My idea was the following, but I think that it is impossible to use `i' in the range

forval i = 1/100 { use "C:\Stata\dataset.dta" in [(`i'-1)*1000+1]/(1000*`i') save "C:\Stata\dataset_`i'.dta" }
Many thanks in advance for your help, and I am sorry if it could be a silly question.

Paolo

likelihood ratio test for SSM (GLLAMM) model

$
0
0
Dear statalisters,

I have estimated a endogenous switching model with a binary outcome using the Gllamm wrapper SSM.

The output of the model is:

Endogenous Switch Logit Regression
(Adaptive quadrature -- 15 points)

Number of obs = 18664
Wald chi2(6) = 2992.69
Log likelihood = -23488.907 Prob > chi2 = 0.0000

------------------------------------------------------------------------------
werk12mnd | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
werk12mnd |
gesprek3mnd | .723094 .0644797 11.21 0.000 .5967162 .8494718
score_wv | .0301779 .0050612 5.96 0.000 .0202581 .0400976
lft | .0209485 .0030777 6.81 0.000 .0149162 .0269807
_cons | -2.92861 .4005486 -7.31 0.000 -3.713671 -2.143549
-------------+----------------------------------------------------------------
switch |
score_wv | -.0084279 .0007673 -10.98 0.000 -.0099318 -.0069239
lft | -.0190405 .001762 -10.81 0.000 -.022494 -.0155869
contgrp | -.1955344 .0250389 -7.81 0.000 -.2446097 -.1464591
_cons | .9732821 .1163154 8.37 0.000 .7453082 1.201256
-------------+----------------------------------------------------------------
rho | -.5406059 .1079286 -5.01 0.000 -.648605 -.0487247
------------------------------------------------------------------------------
Likelihood ratio test for rho=0: chi2(1)= 0.00 Prob>=chi2 = 1.000


rho is signifincant but the likelihood ratio test is as far from significant as possible. How should I interpret these conflicting(?) results?

regards,

Marcel










Annoyingly coded ordinal independent variables

$
0
0
Surveys often include questions with options like "daily," "once a week", "a few times a month",..."once a year", "never". Or something like that. I understand why Qs are worded that way but I find them annoying to deal with. The coding clearly isn't continuous or even roughly continuous. But, I hate to just break the variable up into a bunch of dummies -- you get a lot of variables that way and you lose the fact that the categories are ordered.

What I often suggest doing is treating the variable as categorical, then treat it as continuous, and then do a test to see whether it is ok to treat it as continuous.

I am curious what other people do. I suspect a lot of times people just treat the variable as continuous. But are there other guidelines or suggestions on how to proceed?

Product of row elements

$
0
0
Dear all,

I have a matrix E = ( 1, 2 , -3 ,-6 \ 3 ,4, 1, 3)

How can I calculate the product of all element in each rows.

The results should be r1 = 1 x 2 x (-3) x (-6) = 36 ; r2 = 3 x 4 x 1 x 3 = 36

Thanks.

correlation between two groups of variables

$
0
0
Hi everyone, i am working on a data which have five proxy variables (independent) and eight variables with data (7 independent and 1 dependent) and i want to find the correlation of every proxy variable with all other Eight variables (7 independent and 1 dependent). i have tried many options but i couldn't get what i want. please help me in this situation

incomplete bar legend

$
0
0
Hey stata experts,

I am having trouble identifying the problems with my code when doing a bar graph using six legend items. Specifically, I have seven racial groups on x axis and each group is supposed to have six disability types which need to be graphed out. In my codes, I included six bars to represent the six disability types. When I hit do, stata doesn't spit out any errors but the graph's legend looks wrong: three out of six disability types are correctly shown with colors but colors of the rest of the three legend items are missing. I don't know what's wrong with my code. Is this because the graph area is too small to graph all six out? Any suggestions please?

Below are my codes:

graph bar SLD SLI AUT, bargap(-30) bar(1, fcolor(dknavy) lcolor(dknavy)) ///
bar(2, fcolor(dknavy*.7) lcolor(dknavy*.7)) bar(3, fcolor(maroon) lcolor(maroon)) ///
bar(4, fcolor(maroon*.7) lcolor(maroon*.7)) bar(5,fcolor(dkorange) lcolor(dkorange)) ///
bar(6, fcolor(dkorange*.7) lcolor(dkorange*.7)) ///
blabel(bar, position(inside) color(white) format(%10.0f))over(race, ///
relabel(1 "AsianIndian" 2 "Chinese" 3 "Filipino" 4 "Hmong" 5 "Japanese" 6 "Korean" 7 "OA") ///
label(labsize(medsmall))) title("Distribution of Disability Types", span)subtitle("by Race", span) ///
ytitle("Percent", size(medsmall))ylabel(0(20)100, labsize(medsmall) nogrid) ///
ylabel (0(1.5)8,labsize(medsmall) nogrid) ///
legend(order(1 "AUT" 2 "SLD" 3 "SLI" 4 "EBD" 5 "MUL" 6 "ID") ring(0) position(11) symxsize(2) symysize(2) rows(2) ///
size(medsmall) region(lstyle(none) lcolor(none) color(none))) ///
graphregion(color(white) fcolor(white) lcolor(white))plotregion(color(white) ///
fcolor(white) lcolor(white) margin(zero))

VAR residual heteroskedasticity test

$
0
0
Hello,

I wonder whether a program for VAR residual heteroskedasticity test is available in Stata or not?

Thanks
Orhan

Sorting the results of a two-way table

$
0
0
Hello,

I am using the newest version of Stata (14.1), and I need help with getting a sorted two-way table. In order to get the table I wanted for my data set, I used the following code: "svy: tab county smoker2, row". This gave me the proportion of current smokers in each county in Missouri. However, the results are not in order and there are a lot of counties in Missouri. For my assignment, I need the 5 counties with the lowest and highest smoking prevalence. Is there any way to both get the total proportion of smokers in each county (I used the option "row" for this) and have a sorted table in the results? I can't find the answer to this anywhere online. Thank you very much for your help.

Kelsey

mi estimate with subgroup analysis - no observations in some imputations error

$
0
0
Hi Statalisters -

I am running an analysis using survey commands in mi estimate, and cannot get my analysis to run. I keep getting the following error: " This is not allowed. To identify offending imputations, you can use mi xeq to run the command on individual imputations
or you can reissue the command with mi estimate, noisily"

The code I am trying to run is below:
mi estimate: svy,subpop(if hiv_status==1 & gender==1): proportion retain, over(artph) noisily
mi estimate: no observations in some imputations
This is not allowed. To identify offending imputations, you can use mi xeq to run the command on individual imputations
or you can reissue the command with mi estimate, noisily

I have reviewed the imputed datasets and do not have any missing data. Initially I just imputed for the subgroup of hiv_status==1 but was concerned that there was a problem with this and then imputed for all individuals but am still getting the same error when I try to run the subsequent analyses on the imputed datasets.

Has anyone seen this before, or have any ideas about how to rectify it?

Thanks so much,
Alison

Ivreg2 and robust endogeneity test

$
0
0
Dear all,
Is the endog option in ivreg2 produce similar test to the Wooldridge (1995) score test (for robust standard errors). I cannot find an explicit information on this.

Thanks in advance

Factor Analysis Question

$
0
0
In factor analysis the factors are the result of analysis. So STATA figures out that Variables A, C and E for example are related, and assigns factor weights.

Is it possible to do an analysis where I can tell STATA which variables I want to "force" to include in a factor and have it do the analysis (give me the eigenvalues, factor weights, allow me to do rotations). I have theory that I want to test, and it seems like it would be convenient if I could construct factors rather than derive them. I hear you can do this in SAS, but wanted to see if anyone has any ideas on doing such an analysis and whether STATA can support it.

Thanks.

Winsorize command in STATA

$
0
0
Hello everyone,
I would like to winsorize some of my independent variables. However, I was not able to find the command in STATA 12 that allows me to winsorize.
I would really appreciate any help.
Thank you.

Several questions: asclogit - clogit - mixlogit | dummy coding vs. effects coding | interaction effects | Hausman test

$
0
0
Dear all,

I would like to conduct a discrete choice analysis and prepared my data, but as I have never done that before, I have several questions which I could not answer for myself yet.
Sorry in advance for the long text, I tried to describe my problems in detail.

First of all some information about the data:
I have 325 respondents, each of them made 16 choices (randomly drawn from a large set of different choice situations) among three alternatives each - leading to 15,600 oberservations in total.
The alternatives are unlabelled, meaning that the alternatives are defined by their different attribute levels, and choosing alternative 1 or 2 does not provide any information about the alternatives without looking at their attributes. Alternative 3 in each choice set is a so-called "opt-out" option (choosing to buy none of the two products offered) with all alternative-specific attributes set to zero.

The model should include case-specific regressors as well as alternative-specific regressors, so I would choose a conditional logit model (asclogit or clogit).

Question 1:
I created an alternative-specific constant for the opt-out option (1: opt-out, 0: alt1/alt2), as suggested by Arne Risa Hole in another thread. But I also saw examples for this in other studies in which the coding was done the other way round (0: opt-out, 1: alt1/alt2). Which one is more appropriate? I thought the first version makes more sense if I would like to assess the utility of not choosing any product. The opt-out option is quite relevant in my study, as in about 50% of all 5,200 choice situations, the opt-out option was chosen.

Question 2:
So far, I have defined those regressors, but I also want to estimate interaction effects between some of the case-specific regressors and some of the alternative-specific regressors.
As it is an unlabeled design, I don't want to interact the respondents' characteristics with the alternatives themselves, but with their attributes.
I have used dummy coding for the categorical variables so far, but read that effects coding might be better for my purposes (Bech & Gyrd-Hansen, 2005: "Effects coding in discrete choice experiments").
Now I am wondering how I would have to interprete interaction terms in the case of effects coding, as for instance if two variables have value -1, multiplying them would result in 1, same as with multiplying two variables with value 1.
The more general question would be how to create the interaction variables at all, as I haven't done that yet. Do I have to create them as separate variables to be stored, or are they created as part of the model estimation by combining two variables in the command?

Question 3:
I have already tried out asclogit to estimate the model, and the model shows different results for the case-specific regressors for each of the alternatives (except for the base alternative).
I am not sure how to interprete this information, as the design is unlabeled. Related to this, I found this answer by Arne Risa Hole: http://www.stata.com/statalist/archi.../msg00121.html
but I don't really know how to implement the interaction terms in asclogit.
Would it be more appropriate to use clogit in my case?

Question 4:
I read in Cameron & Trivedi (2009): "Microeconomics using Stata" that if "there is some clustering such as from repeated observations on the same individual" the vce(cluster clustvar) option should be used. I implemented that, but then got an error message when trying to conduct the Hausman test for my model. Why is that happening? Should I delete that option for performing the test and then add it again?

Question 5:
As I would like to do the Hausman test for the conditional logit model to test whether I can use it or whether the IIA assumption is violated, I am wondering whether I need to specify the complete model with all interaction terms first, or whether I could leave out the interaction terms and if the Hausman test for the "simple" model with only main effects already leads to the conclusion that leaving out one of the alternatives leads to significantly different choice outcomes, I would stop there and move to another model (probably mixlogit?).

I would very much appreciate any ideas and help with regard to my questions. Probably some more will follow in the course of estimations...

Thanks a lot in advance!

Cordula

Coefficients of time dummies after Between Estimation in Panel

$
0
0
Dear Stata Listers,

I am using Stata 14 on Windows 7. I have a panel data (country and time dimension) and at this early stage of the analysis I am interested in estimating fixed effects models using
Code:
xtreg, fe
In order to get a quick idea about the between variation, however, I also ran a between model using
Code:
xtreg, be
accidentally including time dummies. I was very surprised to see, that Stata did not omit the time dummies, but rather presented coefficient estimates. The between estimator regresses group means of y on group means of x, so the time variation is lost and I do not understand, how coefficients of time dummies can be obtained.

When using grunfeld.dta, I am not able to replicate the problem, the time dummies are omitted, as expected. It seems to be, that something is seriously wrong with my data.

The is an example, of what I am getting:
Code:
. tsset
       panel variable:  iso3num (strongly balanced)
        time variable:  year, 1960 to 2015
                delta:  5 units

. xtreg gdppcgr govcon gfcf i.year if year>=1980, be

Between regression (regression on group means)  Number of obs     =      1,180
Group variable: iso3num                         Number of groups  =        196

R-sq:                                           Obs per group:
     within  = 0.0349                                         min =          1
     between = 0.4441                                         avg =        6.0
     overall = 0.0583                                         max =          7

                                                F(8,187)          =      18.67
sd(u_i + avg(e_i.))=  1.521715                  Prob > F          =     0.0000

------------------------------------------------------------------------------
     gdppcgr |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      govcon |  -.0511847   .0136953    -3.74   0.000    -.0782019   -.0241675
        gfcf |   .1398083   .0139374    10.03   0.000     .1123135     .167303
             |
        year |
       1985  |   .7817038   2.779749     0.28   0.779    -4.701993    6.265401
       1990  |  -1.363742   1.789255    -0.76   0.447    -4.893461    2.165977
       1995  |   5.053418   2.277406     2.22   0.028     .5607088    9.546127
       2000  |   9.651357   2.155398     4.48   0.000     5.399337    13.90338
       2005  |  -3.674339   2.286362    -1.61   0.110    -8.184715    .8360381
       2010  |   6.794743   1.713597     3.97   0.000     3.414276    10.17521
             |
       _cons |  -3.175652   1.380157    -2.30   0.022    -5.898331   -.4529731
------------------------------------------------------------------------------
I am thankful for any suggestions!
Many thanks.Anna

export tabstat in MS WORD

$
0
0
Hi!

I need to Export a tabstat into MS Word. This is the code I used to create the tabstat:

tabstat A B C, statistics( mean min max ) columns(statistics)


How can I export the results into MS Word? I hope someone can help me.

Thank you, Lea
Viewing all 65055 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>