Quantcast
Channel: Statalist
Viewing all 65378 articles
Browse latest View live

gen new var = betas for factor variable with >2 levels

$
0
0
Hi,
I have variable "make" with 82 values. I used commands:

egen dmake = group(make), label
reg price i.dmake

This created a factor variable and performed a regression just like I wanted, giving each value (except baselevel) of variable dmake a beta coefficient. Now I want to add to my original data a column with those betas corresponding to the value of dmake for every observation.

Can I do that?
Thank you for your help!
Olga

Compare coefficients of two Fama MacBeth Regressions

$
0
0
I am running Fama MacBeth regressions using two subsamples, and trying to compare the coefficients of the two regressions. I first use "est store" to store the two coefficients, and then use "suest" to test whether the two coefficients equal to each other. However, STATA reported the error: regression was estimated with a nonstandard vce (Fama-MacBeth).

Is this because Fama MacBeth regression is not compatible with "suest"? I tried both xtfmb and asreg for FM regression. If so, is there a easy way to compare the two coefficients from two Fama MacBeth regressions? Thank you!

Rolling Averages with dynamic window size (rangestat)

$
0
0
Hi everyone!

I have panel data: For each game_id (e.g. 10), I have daily prices over a period. The length of the period varies strongly: 5 years to about 6 month.

I am trying to calculate a moving average for the daily prices for each game_id (e.g. 10).

I tried the following:
Code:
rangestat (mean) price, interval(date -180 180) by(game_id)
This works fine, but obviously uses the same rolling interval (window size) for all game_ids.
However, for game_ids where I have smaller observed periods (meaning less days), a window size of [-180,180] is too large.
Is there any way to make the interval boundaries dependent on the length of the period (number of days of that game_id), so that games with a smaller period length have a smaller window sizes?

Maybe a loop that iterates through each group (game_id), first checks the number of days present and then uses the number of days to adapt the interval for that game_id?


Here are some exemplary rows:
Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input long game_id float(date price)
10 20054 3.74
10 20055 4.99
10 20056 4.99
10 20057 4.99
10 20058 4.99
10 20059 9.99
10 20060 9.99
10 20061 9.99
10 20062 9.99
10 20063 9.99
10 20064 9.99
10 20065 9.99
10 20066 9.99
10 20067 9.99
10 20068 9.99
10 20069 9.99
10 20070 9.99
10 20071 9.99
10 20072 9.99
10 20073 9.99

...

22 20054 2.20
22 20055 5.30
22 20056 5.30
22 20057 6.20
22 20058 6.20
22 20059 2.20
22 20060 1.30

end
format %td date

I am also open for other approaches if range stat is not the way to go here - although it takes care of a lot of cases that would be a pain to solve manually.

Thank you very much in advance!

Using Margins After Mixed for Random Slope models

$
0
0
Dear statalist,

I'm trying to generate predicted marginal effects of a treatment variable (treat) after estimating a random effect model with random coefficients for the treatment. The treatment variable is nested within countries, so the models include random intercepts for that. I interact the treatment with a covariate for countries (gdp). So, the model has the following equation:

mixed outcome treat##gdp || country:gdp

My question is how to plot the treatment effect at various levels of gdp, but with excluding the random effect. I want to plot only the fixed effect part of the treatment effect, which would reflect the coefficient on the interaction obtained from a regression table. I used the margins command, but it doesn't seem to be doing that. I also tried to add the option: predict(mu fixedonly), but it didn't work.

I would appreciate your suggestions.

logistic regression model

$
0
0
Hello,

I'm building a logistic regression model. What should I check after building first logistic regression model?

I have

Number of obs = 197
LF chi2(7) = 32.62
Prob > chi2 = 0.0000
Pseudo R2 = 0.1321
Log likelihood = -107.15331


Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]

--------------------+----------------------------------------------------------------

Variable (1, continuous)

1.006296 .0110778 0.57 0.569 .9848168 1.028244

Variable (2, categorical (0 or 1))

2.526395 .8480382 2.76 0.006 1.308511 4.877813

Variable (3, categorical (0 or 1))

1.378342 .4699881 0.94 0.347 .7065026 2.689059

Variable (4, categorical (0 or 1))

2.343487 .9733331 2.05 0.040 1.038314 5.289282

Variable (5, categorical (0 or 1))

.6152974 .2070272 -1.44 0.149 .3181881 1.189834

Variable (6, categorical (1,2,3) 1 to be a reference)

(2) 4.020328 1.803574 3.10 0.002 1.668786 9.685502

(3) 2.216634 1.287652 1.37 0.171 .7099492 6.920869

_cons |

.6471313 .3573538 -0.79 0.431 .2192545 1.910013

Q: How do I conclude if this model is the final model? (e.g. assessing the scale of the continuous risk factors or examining the final fit of the model)

Thank you very much,


How to randomly select one member from a household and use them in the panel data?

$
0
0

Hi! As a novice user for Stata, I am so glad that I found this forum, and I sincerely appreciate any input or help in advance. I am using a longitudinal (panel) data of 12 waves. I am trying to randomly choose and use the 12 waves of data of only one member of the household for my longitudinal analyses, even though there could be one or more people from the same household included the survey design. I was able to randomly select only one member of the household by using this syntax (as follows) and got the results as formatted as below.

<SYNTAX>
set seed 12345
gen random = uniform()
bysort hhid (random) : gen byte select = _n == 1
sort hhidpn wave

**hhidpn is an unique id for participants,hhid is household id, pn is person number for your information**

<RESULTS AS simplified examples>

hhidpn wave hhid pn select
3010 1 3 10 0
3010 2 3 10 0
3010 3 3 10 0
3010 4 3 10 1
3010 5 3 10 0
3010 6 3 10 0
3010 7 3 10 0
3010 8 3 10 0
3010 9 3 10 0
3010 10 3 10 0
3010 11 3 10 0
3010 12 3 10 0
3020 1 3 20 0
3020 2 3 20 0
3020 3 3 20 0
3020 4 3 20 0
3020 5 3 20 0
3020 6 3 20 0
3020 7 3 20 0
3020 8 3 20 0
3020 9 3 20 0
3020 10 3 20 0
3020 11 3 20 0
3020 12 3 20 0


So, in this simplified example, even though both 3010 and 3020 (hhidpn) are from a same household of 3(hhid), hhidpn 3010 has been only selected ("select =1") and I would like to use 3010's all variables collected from "12 waves" for my analyses.

In this case, how can I keep and use all "12 waves of variables" only from the randomly selected hhidpn (such as 3010) in my longitudinal set of data?

It could be maybe simple one, but I am actually confused even after looking up previous posts in the forum. Any advice might be appreciated!

Harmonic regression for seasonality analysis

$
0
0
Hi,
I have collected data on water quality parameters (total coliform, faecal coliform, pH, TDS, and NO3) and temperature and rainfall for 2 months interval in a year. I want to see the seasonal effect ( of temperature and rainfall) on water quality parameters. In this case, harmonic regression is widely used to see the seasonality but in stata I did not find specific harmonic regression command.
Q. 1. How to run (command) harmonic regression/trigonometric regression in stata with my existing data structure?
Q. 2. Are these two regressions give the same results?
Thanx.

IIA Test in Multinomial Logit Model

$
0
0
To test independence of irrelevant alternatives (IIA) in mlogit model, specifically in Hausman IIA test, what I need to focus on: p-value or chi2?

Combining Multiple Imputation &amp; CV Lasso Logit

$
0
0
I am currently trying to build a model using a cross-validated lasso logit regression (using cvlassologit) on a data subjected to 30 rounds of multiple imputation by chained equations. Based on my understanding, there are a number of potential methods where the final model features may be selected (e.g. group penalty, stacking imputed data sets prior to cvlassologit, averaging features appearing in an arbitrary percentage of m imputed data sets). After some work, I have elected to pursue the latter, calculating the mean of selected coefficients appearing in at least half of m imputed data sets.
Code:
* testing/training split
set seed 12345
cap drop train
g train = (runiform() > 0.5)

* MICE
mdesc mrimanage age sex bmi sports lumbar2 thoracic2 cervical2 traumahx suddenonset serious dosym backpain legpain armpain neckpain nightpain neurodeficit
mi set flong
mi register imputed bmi lumbar2 thoracic2 cervical2 traumahx suddenonset serious dosym backpain legpain armpain neckpain nightpain neurodeficit
mi impute chained (logit) lumbar2 thoracic2 cervical2 traumahx suddenonset serious backpain legpain armpain neckpain nightpain neurodeficit (pmm, knn(20)) bmi dosym , add(30) rseed(54321) augment

* Export coefficients to excel
forval c = 1/30 {
cvlassologit mrimanage age sex bmi sports lumbar2 thoracic2 cervical2 traumahx suddenonset serious dosym backpain legpain armpain neckpain nightpain neurodeficit if !train & _mi_m == `c' , /// 
        seed(123) lopt postresults
        estimates save pedscvlasso`c', append
        }
        
    local row 1
        forval i = 1/30 {
            estimates use pedscvlasso`i'
            mat list e(b)
            putexcel A`row' =matrix(e(b)) using pedslassoest, modify
            local ++row
        }
With this I was able to calculate the mean coefficients of 6 selected variables, though now I am struggling to figure out an appropriate way to return these values for postestimation purposes. I have tried the following:
1) Modify e(b) of an arbitrary cvlassologit. This didn't quite work as the dimensions of e(b) was not large enough for the 6 selected coefficients and constant (this was true for all 30 imputed sets; the largest size was 5). I feel like this might work had there been an acceptable matrix available to replace, though am unsure if this affects the postestimation commands that follow.
Code:
capture program drop lassoavgs
program define lassoavgs, eclass
   *Calls original data set lasso model to modify parameters as 
   cvlassologit mrimanage age sex bmi sports lumbar2 thoracic2 cervical2 traumahx suddenonset serious dosym backpain legpain armpain neckpain nightpain neurodeficit if !train, seed(123) lopt postresults
            
    mat a = e(b)
    
    *Replaces estimate values with averaged
    mat B = (0.095159) // age
    mat C = (0.286021) // lumbar2
    mat D = (0.462691) // suddenonset
    mat E = (0.517194) // legpain
    mat F = (0.688969) // nightpain
    mat G = (1.394133) // neurodeficit
    mat H = (-0.45125) // cons
 
    mat a[1,1] = B[1,1]
    mat a[1,2] = C[1,1]
    mat a[1,3] = D[1,1]
    mat a[1,4] = E[1,1]
    mat a[1,5] = F[1,1]
    mat a[1,6] = G[1,1]
    mat a[1,7] = H[1,1]
    
    *Renames columns as needed
    matrix colnames a = age lumbar2 suddenonset legpain nightpain neurodeficit _cons
    erepost b = a
end

mi convert flong, clear
lassoavgs
matrix list e(b)
2) Directly create a matrix containing the mean coefficient values. This leads to results, though I am uncertain if creating a custom e(b) will lead to appropriate postestimation results (as opposed to using the e(b) output from cvlassologit)
Code:
 
* Create matrix of coefficients
    matrix input L = (0.095159, 0.286021, 0.462691, 0.517194, 0.688969, 1.394133, -0.45125)
    matrix colnames L = age lumbar2 suddenonset legpain nightpain neurodeficit _cons
    ereturn post L

* Predict in training using xb & convert to probability
    cap drop xb1
    predict double xb1 if !train & _mi_m == 0
    cap drop xb2
    gen xb2 = exp(xb1)/(1+exp(xb1)) if !train &  _mi_m == 0

* ROC, cutpoint, calibration- training
    roctab mrimanage xb2 if !train & _mi_m == 0, graph summary
    cutpt mrimanage xb2 if !train & _mi_m == 0, youden noadjust
    pmcalplot xb2 mrimanage if !train & _mi_m == 0
Is approach 1) or 2) valid given this scenario? If 1), is there a way to modify the size and content of e(b) from cvlassologit for postestimation? If 2), is the above code correct?

Thank you for your help in advance!

Calculate expected value t+1 based on stored estimated coefficients

$
0
0
Hi,

I would like to generate a new variable that forecast the expected return at time t+1 by "company" based on estimated coefficients (from a Fama-Macbeth procedure but the principle is the same as a basic OLS estimation). As a result, I would like to extend the sample by 1 period for each company to record the forecasted expected return.

I do not know how to combine the independent variables with the estimated coefficients to generate the t+1 forecast in Stata. I need the t+1 forecast to be recorded in a generated variable because I want to sort "company" based on the t+1 forecast then. Can someone help? Thanks a lot!

Here is what I have so far (assuming that "invest" is the variable to forecast at t+1):
Code:
clear all
webuse grunfeld

tsset company time
gen F1_invest=F1.invest

xtfmb F1_invest mvalue kstock, lag(2)
est store FMB_Newey



Cubic age variable in linear regression

$
0
0
Hi all,

Upon visual inspection of my data, I noticed that my continuous variable, csh_sh, increased and decreased with age. Using a bar chart comparing mean values across age categories, I observed a cubic function.

I therefore used the following commands which confirmed this:

Code:
regress csh_sh c.age##c.age##c.age
margins, at(age = (18(7)90))
marginsplot
Array




I now have my regression model as csh_sh = b0 + b1age +b2age^2 + b3age^3 excluding control variables for simplification purposes. The results from OLS estimation is as follows:

Code:
regress csh_sh c.age##c.age##c.age
---------------------------------------------------------------------------------
cashshare | Coef. Std. Err. t P>|t| [95% Conf. Interval]
------------------+----------------------------------------------------------------
age | -.0256759 .0049533 -5.18 0.000 -.035386 -.0159658
|
c.age#c.age | .0005023 .0001001 5.02 0.000 .0003062 .0006985
|
c.age#c.age#c.age | -3.11e-06 6.42e-07 -4.84 0.000 -4.37e-06 -1.85e-06
|
_cons | .6778854 .0771562 8.79 0.000 .5266348 .8291361


With regard to the interpretation of coefficients (in bold), can any provide any recommendations? I know that interpreting these coefficients is a lot more complex than interpreting linear relationships.

When I include my control variables, all three coefficients are no longer statistically insignificant. In this case, I would argue no significant relationship is present. However, I would be useful to understand what the size of of the coefficients actually mean/imply.

Any advice/recommendation on this would be really appreciated. Thanks!

New to SSC xtheckmanfe - Panel data Fixed effect with selection and endogeneity

$
0
0
Dear all,
Thanks to Prof Baum a new command is available on SSC.
I call this xtheckmanfe. This command implements the estimator proposed by Wooldridge (1995) and Semykina and Wooldridge (2010).
This version of the command estimates the standard errors based on a bootstrap procedure.
Best Regards
Fernando

Using roctab and roccomp functions

$
0
0
Hi there,

Firstly, I just wanted to say thank you for taking time to read this post.

I have performed roctab on the independent variable x and the dependent variable z. I have also performed roctab on another independent variable y and the dependent variable z. I want to overlay the ROC curves on top of each other. I have tried using roccomp which has so far been unsuccessful. I believe this is because I have a different number of observations for variables for x and y. It appears roccomp only uses the observations which have values for both x and y, leading to different AUC values as to what is calculated by the roctab command for each variable x and y.

Is there a way I can overlay the roctab graphs by not losing observations using roccomp?

Regards, your help is much appreciated.

Jimmy

Code:
roctab z x
roctab z y
roccomp (z x y)

Generalized DiD model with multiple periods and groups

$
0
0
Hi all,

I am estimating the effect on infant mortality (imr) of a health policy implemented on a municipal level starting in 2013, with municipalities adopting the policy in different years. I have constructed panels for each municipality with data ranging from 2010-2017. A Generalized DiD seems to be the most fitting approach.

Following advice on a previous thread, I have run the regression as follows:
Code:
xtreg imr i.treat i.active_treat i.year Z, fe vce(cluster mun)
where treat =1 if the municipality ever implemented the policy and =0 otherwise, active_treat =1 in all periods when the policy is active and =0 for all periods before policy adoption, as well as =0 in control municipalities for all periods. Z is a vector of municipal time-variant characteristics, such as population, GDP, literacy rate, etc.

I have also created a period index variable, m, taking on the value of 0 in the period where the policy was adopted and the value of negative/positive N in the N periods pre/post policy implementation.
My first question is how to incorporate the period index variable, m, (or perhaps only the post-policy m values, let's call it m_post), into the regression to allow for continued effects?
Secondly, the policy implementation differed across treated municipalities in that some received doctors of either type1 or type2, or both. How do I incorporate this into the model to evaluate the impact of a municipality receiving type2 or both as compared to just type1? Is it sufficient just to create a dummy variable and add it in as a covariate or should I restrict the sample to only type1, run the regression, and re-run on the sample with only type2 and only both?

Many thanks for the help!

How to use instrumental variables estimation?

$
0
0
Dear all,

For my thesis, I need to run the following regression:
Array


In a paper, where I base my thesis on, it states: Instrumental variables estimation is still required because perf(it) is used to calculate both delta(perf(it+1)) and delta(perf(it)).
For the moment, my regression in stata is in the following form (see attachment): Array



The performance measure is OperatingIncome. The rest of the variables are slightly different from the specification above.
My question is now: What model should I use to deal with this? I have read that the command ivreg2 can deal with this? Can someone propose an instrumental variable?

Thanks in advance

Call for applications - PhD positions - Doctoral School of Social Sciences, University of Trento

$
0
0
The call for applications for 2020/2021 entry in the doctoral programmes in:

- Economics and Management (6 fully funded scholarships)
- Sociology and Social Research (5 fully funded scholarships)
- Sustainability: Economics, Environment, Management and Society - SUSTEEMS (5 fully funded scholarships)
is now open!
Deadline for applications: 3 June, 4 PM (Italian time)
Online application system: https://webapps.unitn.it/Apply/en/Web/Home/dott

Please circulate this to anyone who may be interested.
For any further clarification please do not hesitate to contact us.


Doctoral School of Social Sciences
Via Verdi 26 – 38122 Trento – IT
Phone: + 39 0461 283756/2290
Fax: + 39 0461 282335
Email: school.socialsciences@unitn.it

multiply coefficient in regression, 95% CI

$
0
0
Hi everyone,

I regress depVar and IndepVar. I get a coefficient of -.0977362 and 95% confidence interval (-.1286172 -.0668551). The unit for exposure variable is in minutes. When it comes to interpretation, one minute difference in the exposure is not really meaningful amount. I am thinking to turn those variables in unit 10 minutes so that I could interpret coefficient as 10 min increase in exposure is associated with XX change in dependent variable. A simple way is that I could multiply the coefficient by 10 but how about the confidence interval? it also needs to reflect the 10 min change.. can I multiply by 10 too?

Thank you.

Time dummies in Pooled OLS regression

$
0
0
Hi,

I am estimating a firm group-level variable using a panel data set.
I have added fixed effects for time, industry, years and countries to account for any observed effects and used xtreg in the estimation.

However, my main independent variable is time-invariant. Therefore, I cannot estimate this within this model. I therefore consider using Pooled OLS for this estimation.

Now, I am rather confused about the difference in definition of both models.
Is a fixed effects model including time and firm fixed effects the same as Pooled OLS with time and firm dummies (and clustering errors)?


Assume the regression equation applying to this is Y = bX + error. Where X is time invariant and Y is time variant.

Can I add time dummies to this, or would this violate the Pooled OLS model?

best,
Frank

lclogitwtp2 r(498)

$
0
0
Dears,

I have a question regarding the command lclogitwtp2.
why lclogitwtp2, cost(price) post with a constrain of the type constraint 1 [Class1]price =0 reports an error and does not calculate the ratios for class2 only?



Willingness-to-pay (WTP) coefficients

--------------------------------
WTP for | Class1 Class2
-------------+------------------
ASC_1 | . 0.926
ASC_2 | . 0.349
ASC_3 | . -1.069
--------------------------------

Please wait: -nlcom- is calculating standard errors for the WTP coefficients.

expression (_b[Class1:ASC_1] / (-1 * _b[Class1:price])) evaluates to missing
r(498);

Is there a solution.

Thanks
Federica

Total Factor Productivity syntax with OLS, FE and LP

$
0
0
Hello Stata community,

I am calculating firms TFP for the period 1990-2015 with an unbalanced panel. I am trying to used the OLS, the FE and the LP.
For the OLS I think I have clear the syntax to obtain the value of TFP:

regress ln_va ln_capital ln_labor, vce(cluster firm)
predict ln_TFP_OLS , resid
gen TFP_OLS=exp(ln_TFP_OLS)

But for FE and LP(Levinsohn and Petrin) I am not sure of the syntax.
For the FE I start with:

xtreg ln_va ln_capital ln_labor, fe vce(cluster firm)

However I do not know the predict syntax and if I need to calculated the exponential value too.

The same problem for PL where I start with:

levpet ln_va, free(ln_labour) proxy(ln_intermediate_inputs) capital(ln_capital) valueadded reps(250)

But I do not know hot to continue.

Does anyone know the complete syntax to obtain the value of TFP with FE and LP?
Thank you very much!

Guido
Viewing all 65378 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>