Quantcast
Channel: Statalist
Viewing all 65428 articles
Browse latest View live

How do I perform VIF test after xtreg

$
0
0
Dear Experts, I have run the following
xtreg xtreg x fdi l k m lc, fe vce(robust)
Is there need to perform VIF test? if yes, how, please? Thank you.

Question about the effect of adding control variables on estimate

$
0
0
Dear all,

I do a research using weak instrumental variables, the main problem is that we are not sure which result should be chosen . We found the instruments are weak, so we decided to use CLR test, K test and AR test to get the weak-identification-robust inferences. The null hypothesis of these tests is that if estimate of beita is equal to zero. Our estimate of ß is around 0.06.
BUT we found that If we add some control variables, we will accept the null hypothesis, however, if we drop these two control variables, we will reject the null hypothesis.
We found that the two control variables are not related with the explanatory variables, so if we can say that it is because these control variables dilute the strength of estimate of ß which is very small ? Or there are other explanations we can use for this case.

Thank you.

Result of not adding two unrelated control variables Array

Result of adding two unrelated control variables Array

Drop variables with common suffix under certain conditions

$
0
0
Dear all,

I am facing difficulties in dropping some variables conditional on the missing observations of one variable.
I have a dataset of prices, volumes and a measure of liquidity for 200 stocks. The dataset is a time series made of 601 variables, date, p1-p200, vo1-vo200 and liq1-liq200.
I want to drop p*, vo* and liq* if liq* has less than 30 observations.
I have tried
Code:
foreach var of varlist * {   
    qui count if missing(`var')     
    if r(N) >= 7535 drop `var'   
}
But there is a big problem with this approach. If let's say p1 has less than 30 but not vo1 than it drops p1 and keeps vo1.

Any suggestion?

Best

Stefano

Oaxaca help - mean wage generated does not match the one given by sum command

$
0
0
Good afternoon,

I am using the oaxaca command to decompose the ethnic pay gap, in the example here I am comparing the pay of Bangladeshi men to White British men. My code used is

oaxaca LogHOURPAY varlist, by(Bangladeshi) pooled categorical(varlist) eform

I have been using a Jann guide https://core.ac.uk/download/pdf/6442665.pdf which has the following in it on page 18:

So it appears this value should give the mean wage for the given group. However when I do this I get the following mean wages for the groups:
so it appears they are £13.37 for White British & £9.02 for Bangladeshi men.

However when I use the sum command mean hourly pay is £15.74 and £10.40 to White British and Bangladeshi men respectively.


I'm using the same data and the number of observations are the same so I am uncertain as to where this discrepancy could come from.

Thanks so much for any help provided, have a great day!

Issues with attempts to drop observations (r109 type mismatch error)

$
0
0
Hi there,

I'm new to Stata and attempting to drop some observations from my sample, however I keep encountering the r109 type mismatch error. I would highly appreciate any help that any of you kind users may be able to offer.

The variable I'm working with reflects the percentage of shares acquired through an M&A deal, and these are displayed as numbers to four decimal places, but the type and format described by Stata are str9 and %9s respectively.

I wish to drop all observations less than 51%, so following the instructions here I use the following code:

Code:
drop if me_shares_acq_pct < 51
However I keep encountering the r109 error. I understand this may be due to how the variable values are formatted, but I'm unsure of how to format them correctly so that this process may go ahead unimpeded.

Please let me know if you require any further information regarding the issue.

Fractional Proportional Models and Controlling for Time.

$
0
0
Hi Statalist,
I have a dataset with a large number of observations and a small T. The years for my dataset is 2000-2016.
My DV is a proportion. I tried to really dissect some of Jeffrey Wooldridge's work last week regarding
fractional proportional models, especially with regards to the issue of large-N and small-T. (e.g. Papke and Wooldridge 2008, Professor Wooldridge's presentation: https://www.stata.com/meeting/chicag...wooldridge.pdf)
I'm not sure if what I'm doing is correct. My unit of analysis is directed-dyad year (State A-StateB year1, State B-State A year1). I made
a variable called "dyad_dir" where the dyad variable counts each dyad separately (e.g. dyad "1" for StateA-StateB. Dyad "2" for StateB-StateA, dyad "3" for StateA-StateC, dyad "4" for StateC-StateA, etc). I set the panel as dyad_dir year.



My two main independent variables (IVs) are binary--(0,1). One variable has a more limited time frame-2000-2009.
I have a lot of other control variables. My main IVs are x1 and x2.

So far, the panel data shows as strongly balanced. I'm assuming I will still have to use time average variables and control for the number of time periods available for each cross-sectional unit given the small T, and this is where my uncertainty lies.

I created time dummies for all years (2000-2016), but when I run the regression with the time dummy variables, y00-y16. Some years were omitted, while others were not.

I tried to create time averages for the main independent variables. (I am not sure if I have to do it for the control variables).
I did the following, as per the files from the Stata meeting cited above, but I'm not sure if it's correct:
egen x1b = mean(x1), by(dyad_dir)
egen x2b = mean(x2), by(dyad_dir)

I tried to create a variable to control for the number of years available, but the variable was omitted because of collinearity. I also tried to play around with the regression (e.g. not include the year dummies, time averages), and it was still omitted.
egen tobs = sum(1), by(dyad_dir)
gen tobs17 = (tobs == 17)

Data seen below from dataex:
Code:
input float(year y dyad_dir) byte x1 float(x2 x1b x2b x3 x4) byte x5 float(x6 x7 x8 x9) byte x10 float(x11 x12 y00 y01 y02 y03 y04 y05 y06 y07 y08 y09 y10 y11 y12 y13 y14 y15 y16 tobs17)
2000          . 1 0 1 0 1 1 0 2         1 1 1 .031967163 1 0 1.8011683 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
2001         .5 1 0 1 0 1 1 0 2         1 1 1 .024324417 1 0 1.8102077 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
2002          0 1 0 1 0 1 1 0 .       1.5 1 1 .012064934 1 0 1.6868168 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
2003   .3636364 1 0 1 0 1 1 0 .         1 1 1 .023251534 1 0 1.5599297 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1
2004   .3333333 1 0 1 0 1 1 0 .         1 1 1   .0307827 1 0  1.437317 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1
2005   .5555556 1 0 1 0 1 1 0 .         1 1 1 .032816887 1 0 1.2836504 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1
2006   .3076923 1 0 1 0 1 1 0 .         1 1 1  .03156376 1 0  2.826926 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1
2007         .4 1 0 1 0 1 1 0 .       1.5 1 1  .02896404 1 0  .9335653 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1
2008  .14285713 1 0 1 0 1 1 0 .       1.5 1 1 .017453194 1 0  .9192817 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
2009          0 1 0 1 0 1 1 0 .         1 1 1  .02192688 1 0  .8979353 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1
2010      .6875 1 . 1 0 1 1 0 .         1 1 1  .01936817 1 0  .8552898 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1
2011   .5555556 1 . 1 0 1 1 0 .         1 1 1  .00677681 1 0  .8495679 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1
2012   .4782609 1 . . 0 1 1 0 .         1 1 1 .015763283 1 0   .834486 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1
2013   .4117647 1 . . 0 1 1 0 .         1 1 1 .012332916 1 0  .8337547 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1
2014          0 1 . . 0 1 1 0 .         1 1 1 .012856483 1 0   .838679 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1
2015          0 1 . . 0 1 1 0 .         1 1 1 .031881332 1 0   .850991 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1
2016          . 1 . . 0 1 1 0 .         1 1 1 .037225723 1 0  .8440136 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
2000          . 2 0 1 0 1 1 0 .         1 1 1   .3551378 1 0 1.8011683 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
2001          0 2 0 1 0 1 1 0 .         1 1 1  .34648895 1 0 1.8102077 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
2002          0 2 0 1 0 1 1 0 .         2 1 1   .3478985 1 0 1.6868168 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
2003          0 2 0 1 0 1 1 0 .         2 1 1   .4008026 1 0 1.5599297 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1
2004          0 2 0 1 0 1 1 0 .       1.5 1 1   .4411621 1 0  1.437317 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1
2005          0 2 0 1 0 1 1 0 .       1.5 1 1   .4520903 1 0 1.2836504 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1
2006         .3 2 0 1 0 1 1 0 .       1.5 1 1   .4636974 1 0  2.826926 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1
2007          0 2 0 1 0 1 1 0 .       2.5 1 1   .4766197 1 0  .9335653 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1
2008   .3333333 2 0 1 0 1 1 0 .         2 1 1   .5061283 1 0  .9192817 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
2009          1 2 0 1 0 1 1 0 .         2 1 1   .5294323 1 0  .8979353 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1
2010       .125 2 . 1 0 1 1 0 .       1.5 1 1  .54753304 1 0  .8552898 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1
2011          0 2 . 1 0 1 1 0 .         2 1 1   .5656557 1 0  .8495679 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1
2012  .06666666 2 . . 0 1 1 0 .         2 1 1   .5645571 1 0   .834486 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1
2013   .4615385 2 . . 0 1 1 0 .         2 1 1   .5920744 1 0  .8337547 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1
2014  .14285713 2 . . 0 1 1 0 .         2 1 1    .624382 1 0   .838679 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1
2015          0 2 . . 0 1 1 0 .       2.5 1 1   .6467857 1 0   .850991 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1
2016         .2 2 . . 0 1 1 0 .         2 1 1   .6825209 1 0  .8440136 0 0 0 0 0 0 0 0 0 0 0 0
The regression that I used in Stata is xtgee y x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 tobs17, family(binomial 1) link(probit) corr(exchangeable) vce(robust). I'm not sure if I made the time averages correctly, and if I did the control for the time periods correctly either. In other words, I'm not sure if I accounted for the small t, large n correctly for the GEE.

Thanks in advanced for any help.

Using georoute with zipcodes

$
0
0
Hi all,

I'm trying to generate driving distance between a variable with zip codes only and a variable with 5 different addresses. It looks like traveltime3 is no longer supported so I'm trying to see if there's a way to use georoute to accomplish this. I have a HERE api ID and code, and entered the following: (I changed the xy coordinate numbers but kept the - sign).

georoute, hereid(_mine_) herecode(mine) startxy(40.21,-95.75) endxy(40.321, -94.232)

And I get the following error:
"option -95.75 not allowed"
r(198)

Does anyone know why I'm getting that error? I tried searching for the error code and command but can't find anything useful.
Thanks in advance.

Interpretation of standardized coefficients

$
0
0
I know that we have previously discussed how the use of standardized coefficients may obfuscate results rather than shedding more light. I completely agree!

However, I am curious to really understand how one can use these coefficients, if one had to. Consider we run an OLS on standardized y and x, which yields the standardized coefficient beta.

std(y) = 0.191 std(x)

One may interpret this as 1SD change in X is associated with 0.191SD change in y. Consider the s.d of the unstandardized variables are sd(Y) = 27.37 and sd(X) = 0.094. In this case, would it be reasonable to infer that 1SD change in X (sd=0.094) is associated with 1.8% change in Y (0.094*0.191)?

Thanks in advance!

Testing for uniform distribution

$
0
0
This thread suggests that to test for the sample uniform distribution quantile command can be used.

I am having a sample of N numbers produced by a certain random number generator. The chart built by quantile looks very good (replicates the 45 angle line closely).

But, as I understand this is not sufficient yet to conclude the generator is good. Here is an example:

1,9,2,5,8,6,4,7,3 -> good (for me)
1,2,3,4,5,6,7,8,9 -> not good (for me)


in the second case the quantile plot results in the same picture, but it is "obviously" not random.

The hard part is to formulate what exactly is desirable, but I guess if I have a sample of N then the quantile should produce "good" pictures for the first N/2 and for the second N/2 numbers while preserving the same min, max, mean, median (and we repeat recursively). This got to be known as some standard property please let me know what term applies here.

This article discusses a lot of approaches to the problem, but has a lot of broken links.

If it matters, I am actually having a limited set of K samples with N1, N2, .. NK elements coming from the same generator. Sample sizes are in the thousands. There is no reason to believe that the generator in question has a finite period, or it is extremely large much larger than any of the N I have for the sample sizes to test.

In the end I'd like to get some idea of e.g. whether the generator in question is better or worse than Stata's own, and if worse, how much worse (if such metrics exist).
And this is more of a quick diagnostics check, and not a research project.

Example data N=1000:
Code:
insheet using "http://www.radyakin.org/statalist/2019/rnd1000.txt"
Thank you, Sergiy Radyakin

Create vars with conditions inside a loop

$
0
0
Hi,

I have the next data
Id n_lo1 gr1 gr2 gr3 gr4 gr5 gr6 pri
01 0600 15 30 54 78 96 45 A1
02 0700 20 0 12 41 52 25 A1
03 0800 30 40 12 14 23 63 A1
04 0900 45 12 10 12 45 45 A2
05 1000 2 52 8 78 12 67 A2
06 1100 14 20 69 11 14 21 A2
07 1200 78 5 74 78 78 58 A2
08 1300 5 2 90 32 98 47 A3
and I want to generate vars with a lot of combinations but for the example I just gonna write this two: 1) CDT_0600_F_1 2) CDT_0600_F_2
Id n_lo1 gr1 gr2 gr3 gr4 gr5 gr6 pri CDT_0600_F_1 CDT_0600_F_2 .......
01 0600 15 30 54 78 96 45 A1 15 30
02 0700 20 0 12 41 52 25 A1
03 0800 30 40 12 14 23 63 A1 30 40
04 0900 45 12 10 12 45 45 A2
05 1000 2 52 8 78 12 67 A2
06 1100 14 20 69 11 14 21 A2
07 1200 78 5 74 78 78 58 A2
08 1300 5 2 90 32 98 47 A3
For that reason, I created the next sintaxis

levelsof n_lo1,local(levels)
foreach m of local levels{
foreach j of numlist 1/6{
gen CDT_`m'_F_`j'=gr`j' if n_lo1=="`m'" & pri=="A1")
gen CDT_`m'_F_`j'=gr`j' if n_lo1=="`m'" & pri=="A2" & `j'<=4)
gen CDT_`m'_F_`j'=gr`j' if n_lo1=="`m'" & pri=="A3" & `j'<=2)
}
}
But It doesn't work with the conditions: ("pri=="A2" & `j'<=2") or (pri=="A3" & `j'<=2) or (pri=="A1")

because it still creating vars with all combinations in spite of the conditions, For example. I just want the next vars in case pri="A3"

CDT_1300_F_1
CDT_1300_F_2

And it creates

CDT_1300_F_1
CDT_1300_F_2
CDT_1300_F_3
CDT_1300_F_4
CDT_1300_F_5
CDT_1300_F_6


I hope you can help me.

Kind Regards,
S.

Omitted variables due to collinearity in PPML regressions

$
0
0
Dear Statalist,

I am trying to run gravity regressions on the impact of FTA's for different sectors and having problems with the tariff variables. I tried to estimate the regression using the standard comand for PPML:

ppml_panel_sg tradeflow FTA n_MFN, ex(exporter) im(importer) y(year) cluster(pair_id)

My problem is that Stata drops the tariff data with the note:

note: ln_MFN omitted because of collinearity over lhs>0 (creates possible existence issue)

I would be really thankful if someone could help me with this issue.

Thanks a lot.

Why not significant?

$
0
0
Dear All, I was asked why the estimate below (shaded) with t-value -2.04 is not statistically significant? Any comments? Thanks.

Array

Problem with small number of clusters using reghdfe and vce suboptions

$
0
0
Hi,

I am using the command reghdfe for two way clustering at the state and time levels. I have two types of models - one with fixed effects (both state and time fixed effects) and one without fixed effects. Both the models need to include two way clustering for standard errors. The problem I am facing is that the number of clusters in my models is small - 7 for states and 36 for the time period. So, reghdfe is not effective in this case by itself.

I came across a post that mentioned the use of suite(mwc) or suite(avar) within vce(cluster). However, when I try to use either of these, Stata shows an error: VCE options not supported and I'm not sure why this is the case and how can it be fixed?

Also, as mentioned, one of my models does not include fixed effects. What is the correct way of using reghdfe with two way clustering when fixed effects are not included in the model?

Thank you!

How to use iv*iv in regression

$
0
0
Hello everyone:

Is there anyway that i can run a reg like: reg dv iv1 iv2 iv3*iv4
By which mean iv3*iv4, I already have iv3 and iv4 available, is there any easyway other than generate a new variable: iv5=iv3*iv4?

Thanks,

Merging Excel Files to Create a Stata Dataset

$
0
0
Hello all,
New here, new to statistics, programming, and brand new to Stata so bear with me.
I am trying to merge three separate Excel files (Location: "C:\Stata") into a single Stata dataset ('MERGED') for analyses.
File names:
AE.xls
FZ.xls
WEIGHTS.xls
Unique Identifier (in all three): ID

What would be the code to go about creating this new data set? Thank you so much in advance.

probing interactions in quadratic multiple regression

$
0
0
I am trying to probe interactions between my predictor (X) and moderator (Z) in stata. where X has a quadratic relationship with the dependent variable.
I have used the margins command to test for slopes at different levels of Z (mean, 1 SD above mean and 1 SD below mean) but not sure if i am on the right track, as my results seem of

margins, dydx(X) at(Z=(-0.4978,-0.0680,0.3618))

Help regarding the stata coding

$
0
0
I need your help regarding the code for multilevel modelling that I am referring to but it is not running from couple of days. The outcome variable is height whose variation I am about to see at individual, household, district and the state level. The code that I am running is unable to get processed as it is showing the error of "error obtaining the robust variances".
I am attaching the code which I am using for your consideration.
I shall be obliged if you can please help me out with this.

Code:
setmaxiter 100
xtmixed ht wt v212 i.v149 WEIGHT_FOR_AGE [pweight = v005] || new_code: GROWTH GROWTH_RATE || sdistri: dist_lit dist_urban || v002: i.v190 i.v025, nostderr

repeated time values within panel

$
0
0
Dear All,

I have a panel data set. I created both PANEL ID and Time ID. As required, the time ID's are all integers. However when I am going for tsset - every time it is giving the error - "repeated time values within panel".

I checked for duplicates using "duplicates report" and it reported 0 surplus indicating all values are unique.

Can you let me know how to fix this? And or what mistake I am doing. Thanks in advance,

~Subaran.

Panel Data: Omitted Variables due to Collinearity

$
0
0
Hello everyone, referring to my current study mentioned here: https://www.statalist.org/forums/for...c-price-method

Currently, I have a total of 5 housing variable, 14 distance variable (12 are distance to parks), and 12 size of parks variables. The data is for 6 years and there are 645 observations. My initial plan was to only look at the distance of the closest park and its respective size (not to include all 12 parks). However, I noticed that the two specific variables are not significant. Explains why I decided to include all parks variables in the analysis.

The plan was to run three separate models (Model 1 consist of housing variables; Model 2 consist of housing and distance variables; Model 3 consist of housing and size of parks variables).

I did not experienced any problems with the first two models whereby I ran the Pooled OLS, xtreg, re and xtreg, fe. However, coming to the third model the minute I ran regress for housing and size variables, the results informed me all of the size variables were removed due to collinearity. Likewise for re. I know it does so for fe as the size variables are time invariant in which some of them are the same throughout the years (correct me if my definition is wrong).

I have also done the necessary test to identify which model between the three are more advisable to be used.

I don't plan on using xtreg, be as the OLS and random effect model is much efficient (?).

Would appreciate if anyone could provide comments or advice on the matter.

Loading Data from ODBC Data source in a loop

$
0
0
Dear Stata Forum,


I have encountered this problem yesterday and I could not figure out how to do it. I have to import a bunch of files from an online platform and save them in a folder. Basically, I have 1 file for each trading day. I tried looping with a local over dates, but it does not work (it works if I put the date myself in the code, but that is not feasible for more than 300 files). I tried something like this:


local dates 131218 311218

foreach x in local dates {

odbc load underlying_basket_isin="underlying_basket_isin" underlying_basket_aii="underlying_basket_aii" underlying_index_name="underlying_index_name" ///
notional_currency1="notional_currency1" trade_id="trade_id" execution_venue="execution_venue" compressed="compressed" price_notation="price_notation" ///
price_rate_eur="price_rate_eur" price_rate="price_rate" notional="notional" notional_eur="notional_eur" quantity_type="quantity_type" quantity="quantity" ///
up_front_payment="up_front_payment" execution_timestamp="execution_timestamp" effective_date="effective_date" maturity_date="maturity_date" ///
termination_date="termination_date" settlement_date="settlement_date" ccp_id_type="ccp_id_type" ccp_id="ccp_id" ///
clearing_timestamp="clearing_timestamp" intragroup="intragroup" underlying_maturity_date="underlying_maturity_date " ///
payment_freq_dq="payment_freq_dq" reference_period="reference_period", table(lab_prj_'x`_Positions) dsn("DISC DP Impala 64bit") clear noquote

cd "chosen directory"
save `x'_pos.dta,replace
}

but this does not work because Stata fails to recognize the x in the loop. Any sort of help is welcome! Thank you very much.

Filippo
Viewing all 65428 articles
Browse latest View live