Quantcast
Channel: Statalist
Viewing all 65428 articles
Browse latest View live

Can I convert variable frequency to a new variable that represents frequency range?

$
0
0
I am using Stata 14.2

I have a categorical variable district_name and another categorical variable store_name in each district. I encoded district to get the number of stores in each district (encode dist_name, gen(dist_n)). Now my new variable looks like this:

Dist_n Freq (no of store names)
DistA 10
DistB 1200
DistC 450
DistD 80
DistE 690

Is it possible for me to generate a new variable that represents a range of the frequency of stores in districts. Ex.

District_size N
0 to 99 stores 2
99 to 999 stores 2
1000+ stores 1

regexs() in MATA

$
0
0
Dear all,

I was wondering whether the issue of regexs() in MATA mentioned here has already been solved. In my case the problem is the following. Assume I have matrix X form which I need to get matrix Y using regular expressions (This problem is simplified for the explanatory purposes).

Code:
X = "H 2000 A" \  "H 2001 A" \ "H 2002 A" \  "H 2003 A"
Y = 2000 \ 20001 \ 2002 \ 2003
However, if I do the following I only get the value of the last row
Code:
regexm(X, "([0-9]+)")
Y = regexs(1)
Y
Which is not what I need. Then, I tried to solve this problem by using regexr(). However, notice that the second time it is executed, it does not work as expected.

Code:
Y=X
Y = regexr(Y, "([A-Z]+)", "")
Y = regexr(Y, "([A-Z]+)", "")  // This does not remove the "A". I solved it by including a space " ", but it is not supposed to be so  --> regexr(Y, " ([A-Z]+)", "")
Y
So, does anyone of you know how to use correctly regexs() in MATA?

Thank you so much,

Pablo

fixed effect correcting auto correlation and heteroskedasticity in panel data

$
0
0
Dear members,

My question is a follow up from this thread http://www.stata.com/statalist/archi.../msg00127.html

Jeff Wooldridge suggested that the clustering option is only valid for small T and large N. I would like to ask which command should be used instead if T is larger than N? Also for cases where T and N are both large.

Many thanks
Jay

Panel data dummy variable omitted

$
0
0
Respected members,

I am new with stata and facing certain issues. I would be highly obliged if someone would help. I am working on a strongly balanced data of 85 companies for 14 years (2001-2014). I have a foreign ownership as an independent variable where 0 is coded as foreign and 1 is coded as domestic. Hausman test requires to run fixed effects but running fixed effects omits this variable because of collinearity. I cannot drop this variable since it is very important. Can I use random effects?
Moreover, there is heteroskedasticity and autocorrelation in model. Can I use cluster (i) at the end of random effects model to remove both or there is some other command to correct both?

Regards.

ologit brant test

$
0
0
I am using brant, detail after ologit to test the podds assumption. For each regressor, there are two lines of numerical values in the output of this test. the first line contains the regression coefficients (one for each of the C-1 cuts of the ordinal outcome). What is contained on the second line? thanks

seriously unsure on what to use for heteroskedastic panel data

$
0
0
Hi all

For the past week or two I've been so confused on what regressions to run for my data. I'm running a model to compare the effect of foreign aid on developing countries. My dependent variable is GDP capita growth, with my independent variables being initial gdp per capita, aid/gdp, trade openness(of gdp %), foreign direct investment, population growth, violence. Aid/gdp is the main variable with the rest being controls.

Initial gdp per capita is the initial capita for the start of 5 periods. Every 5 periods it switches so 1970-1974 would have the 1970 GDP, 1975-1979 would have 1975 and so on. I wanted to just apply 1970 to all of it but it came up with the issue of multicollinearity. Every piece of literature that includes this has the variable as significant and negative, so it's important that I try and reach the same. This is because of the idea of economic convergence

The number of countries is 40, with the period being 44 years (1970-2014).

-----------------------------------------

Through a hausman test it indicated that the fixed effects model is preferred. Running xtreg fe brings up some nice results akin to much of the literature (all significant bar violence; aid/gdp being negative and initial gdp per capita being negative). However running 'xttest3' suggests that the data is heteroskedastic, and hence I must use other models.

My first problem is this: From my research and notes I've been met with the following: xtpcse, xtreg fe robust, xtgls and xtscc. I'm not sure on what to use. Through xtserial, 'correlate' amongst lagged variables and actest no autocorrelation was found.

I don't know about xtscc. From what I've read on the forums it seems that it needs to be have some level of autocorrelation with high N but low T. With mine they're medium and roughly equivalent. At the same time, running the 'xtcsd pesaran abs' command (which, according to university notes tests for contemporaneous correlation), yields pr = 0.00000, suggesting that driscoll and kraay standard errors should be used in this instance. Nonetheless, they unfortunately make all my variables insignificant bar aidgdp which yields p>|t| as 0.042

Code:
xtgls x y, panels(heteroskedastic)/
xtpcse x y
I've seen are both good. They both follow the original xtreg fe command.

Code:
xtreg fe robust
unfortunately makes my initial gdp insignificant, although aid/gdp is still significant alongside FDI and trade.

---------------------------------------------

My second problem is this: from my university notes, running the testparm for both i.year and for i.country suggest that country/time-fixed effects need to be used as pr > 0.0000. Now, running

Code:
(1) xtreg x y i.year, fe robust

(2) xtreg x y i.countryid  , fe robust
(1) makes all my variables insignificant bar aid at 1% (2) omits all my country IDs but says fdi is 0.01% significant

Applying both year and countryID in one command makes my initial gdp per capita negatively significant but aid itself insignificant, which is quite confusing.

Code:
(1) xtgls x y i.year, panels(heteroskedastic)


 
(2) xtgls x y i.countryid panels(heteroskedastic)
both of them support the previous original xt reg fe and the original xtpcse/xtgls (where aid and initial gdp are negatively significant)

combine them and they show that aid is insignificant, yet again

Code:
xtpcse x y i.country
xtpcse x y i.year
both show negative initial gdp, although year has aidgdp at significant at 0.1 whereas country has it for 0.05.

Combining the two has aid insignificant but initial as negatively significant


So what should models would be most appropriate to use? Do I need to control for time-fixed effects? I'm just so confused and really desperate for a response
Thanks a lot

merging data stata 14

$
0
0
I am trying to merge two data sets the master contains 50 million observations, and 50 variables several occurring on each day. the using database contains 20 variables and 10 million observations. I tired merging using 1:1, 1:m, m:1 and I receive the following error messages:

variable ISIN not found in using data. The variable ISIN is in the using dataset
variable ISIN does not uniquely identify observations in the master data. This variable is in the master dataset.

grateful if someone could help demystify these error messages and how best to solve this problem.
thank you in advance
Liz

How to split string variable names

$
0
0
so i have a string variable like below:

var1
percent gdp: 1258
current account: 1216
exchange rate: 740

I want to split them so i can have the text and the number separate.

as in;

var1 var2
percent gdp 1258
current account 1216
exchange rate 740

any help please?

Converting data rows to columns

$
0
0
I copied my data from an excel file. The variables (C,D,E, F etc.) are years. For some reason, the year variables are shown as C,D,E etc, but they represent 1990, 1991, 1992. I want to transform my data into a format in which for example Austria is listed in a separate row for each observation of austria. Like:

Aut Austria C 76.05
Aut Austria D 76.62
Aut Austria E 80.51

However, after reviewing some other topics, I am still unable to do so. Any help is much appreciated.


Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input str3 CountryCode str15 countryname str5(C D E F G H I J K)
"AUT" "Austria"         "76.05" "76.62" "80.51" "80.68" "81.73" "82.96"    "84.79"    "86.19"    "87.53"
"BEL" "Belgium"         "81.11" "80.87" "83.37" "85.59" "87.05" "88.08"    "88.64"    "89.64"    "90.82"
"CHE" "Switzerland"     "80.51" "80.52" "83.67" "84.17" "85.64" "86.23"    "86.37"    "87.23"    "88.74"
"CZE" "Czech Republic"  "."     "."     "."     "."     "64.56" "66.94"    "69.84"    "71.0"    "73.71"
"DEU" "Germany"         "63.69" "63.48" "72.02" "72.97" "74.6"  "75.57"    "76.88"    "78.12"    "80.31"
"DNK" "Denmark"         "75.9"  "76.62" "80.01" "81.21" "82.68" "83.97"    "84.69"    "84.9"    "86.14"
"ESP" "Spain"           "66.42" "67.71" "70.52" "72.51" "74.93" "76.26"    "77.3"    "78.06"    "79.44"
"EST" "Estonia"         "."     "."     "41.46" "43.1"  "46.11" "49.59"    "61.52"    "63.71"    "68.64"
"FIN" "Finland"         "63.72" "64.59" "69.17" "71.02" "73.32" "74.74"    "76.15"    "82.18"    "83.85"
"FRA" "France"          "74.06" "74.92" "78.25" "79.7"  "80.86" "80.53"    "81.72"    "81.96"    "83.45"
"GBR" "United Kingdom"  "80.46" "80.28" "82.28" "82.33" "84.02" "84.62"    "84.87"    "85.7"    "86.64"
"GRC" "Greece"          "48.64" "49.79" "62.04" "63.01" "66.72" "67.58"    "68.18"    "69.63"    "70.87"
"HUN" "Hungary"         "53.51" "58.99" "62.18" "65.91" "69.11" "72.34"    "74.98"    "76.99"    "79.13"
"IRL" "Ireland"         "72.29" "72.65" "74.76" "76.34" "78.02" "79.61"    "80.84"    "82.15"    "83.2" 
"ISL" "Iceland"         "53.08" "53.24" "55.02" "56.56" "63.72" "65.62"    "66.9"    "67.69"    "73.57"
"ITA" "Italy"           "65.35" "66.18" "68.93" "70.74" "73.09" "73.51"    "74.21"    "75.4"    "77.13"
"LUX" "Luxembourg"      "71.83" "72.0"  "74.54" "81.66" "82.4"  "76.83"    "77.16"    "77.96"    "78.56"
"LVA" "Latvia"          "."     "37.64" "38.6"  "39.3"  "40.66" "46.49"    "48.88"    "51.78"    "55.02"
"NLD" "Netherlands"     "81.57" "81.72" "83.74" "86.07" "86.49" "86.83"    "87.18"    "87.3"    "88.11"
"NOR" "Norway"          "75.17" "75.53" "77.61" "77.68" "78.52" "80.02"    "80.43"    "81.98"    "82.63"
"POL" "Poland"          "47.34" "47.53" "53.13" "59.0"  "62.34" "63.0"    "65.26"    "66.28"    "68.47"
"PRT" "Portugal"        "53.19" "53.5"  "61.25" "67.38" "71.0"  "72.42"    "73.74"    "75.6"    "77.13"
"SRB" "Serbia"          "39.43" "40.38" "45.38" "45.75" "45.69" "41.55"    "41.72"    "42.28"    "42.51"
"SVK" "Slovak Republic" "."     "."     "."     "."     "54.74" "56.49"    "61.55"    "63.89"    "66.13"
"SVN" "Slovenia"        "."     "."     "40.02" "43.5"  "47.59" "51.08"    "52.85"    "58.04"    "64.16"
"SWE" "Sweden"          "78.21" "78.86" "81.31" "81.06" "83.01" "84.32"    "85.0"    "85.66"    "87.45"
"TUR" "Turkey"          "47.1"  "44.73" "48.29" "51.51" "53.88" "59.11"    "59.93"    "60.64"    "62.54"
"USA" "United States"   "73.19" "72.6"  "74.4"  "74.57" "76.24" "76.6"    "78.13"    "78.59"    "79.19"
"CAN" "Canada"          "78.86" "78.83" "80.45" "81.52" "82.75" "83.85"    "84.98"    "85.42"    "86.69"
"AUS" "Australia"       "73.35" "73.99" "74.51" "76.17" "77.15" "78.64"    "78.98"    "78.36"    "79.13"
"JPN" "Japan"           "48.95" "49.57" "51.54" "56.84" "57.64" "57.69"    "53.75"    "58.21"    "59.32"
"KOR" "Korea"           "41.65" "41.09" "41.89" "43.7"  "50.11" "51.15"    "52.12"    "54.64"    "56.06"
end

highlighting multiple points on funnel plot

$
0
0
Hello

I am trying to figure out how to highlight multiple points on a funnel blot using the name of the site only if possible.

I can get one to work using markunit, but I haven't figured out how to have multiple sites marked

Does anyone know the easiest way or if it is even possible?

Thank you

Julie

scientific notation on stata

$
0
0
Hello, I am trying to transfer a datasheet of excel to stata v11, and one of the varible(dateof diagnosis) appears on stata in a exponential form. Where do i have to go, to change this? Thank you for your help.

Copy values to the group

$
0
0
Hi There,

I am new to STATA and this is my first post here. Excited to be here.
I have a question on copying a value to missing values of the group.
Below is the illustrative problem I have. How do I copy the value for all of the fields of a group. e.g. I want to copy 1 for all of value cells of group 1 and so on for the rest of the groups.
Can anyone help me here with STATA code?

Thanks
Karthik
ID Year Value
1 2004 1
1 2005 .
1 2006 .
1 2007 1
1 2008 1
1 2009 1
1 2010 .
1 2011 .
2 2004 0
2 2005 0
2 2006 0
2 2007 0
2 2008 .
2 2009 .
3 2004 1
3 2005 1
3 2006 .
3 2007 .
3 2008 .
3 2009 .

Accessing Impact of Policy Change Using Panel Data

$
0
0
Hi - I am new to Stata and am having trouble with a project aimed at identifying the impact on states of implementing new legislation. I have yearly data on 15 states from 2006-2015. The legislation I am interested in is at the state level and as such the states implement the policies at different times. For the 2 policy changes of interest, I have 2 dummy variables that are equal to 1 for each state in each year that the state has the policy in in effect, and 0 otherwise. I am interested in measuring the impact of the policy changes on my outcome variable, both in aggregate (the average impact in any given state for a policy change) and on a state-by-state basis.

Any help would be much appreciated.

Stata MP and -statsby-

$
0
0
Does anyone know why the statsby command is not parallelized by Stata MP? This seems like the simplest inherently parallel operation in Stata, and yet it appears to be implemented in a completely consecutive way.

Hurdle logit-Poisson model: marginal effects equal the coefficient estimate

$
0
0
Dear Statalist,

Having estimated the coefficients in a Hurdle Logit-Poisson model by:

Code:
svy: hplogit y $xvars
and then the marginal effects for both stages:
Code:
margins, dydx(*) predict(eq(logit))
margins, dydx(*) predict(eq(poisson))
produces the same results for coefficient and marginal effect. This is also the case in the example posted by Cameron, Adrian Colin, and Pravin K. Trivedi. Microeconometrics using stata. Vol. 5. College Station, TX: Stata press, 2009. on page 572 and 574.

What does mean? Are coefficient and marginal effect the same or is there is a mistake in the commands?

Any help is appreciated!

Implementing IPTW/DR on complex survey data

$
0
0
I am working on a propensity-score-weighted model to estimate the effect of college experience on several outcomes of interest, net of selection into college. The data are the Educational Longitudinal Study of 2002 (ELS:2002). They are stratified and weighted for representation of the 2002 national high-school population, and missing data are handled using multiple imputation:

Code:
mi svyset PSU, weight(F3BYPNLWT) strata(STRAT_ID)
I understand that teffects, pscore, etc., are not compatible with the svy suite, so have been seeking to replicate the inverse weighting process (ref: http://www.statalist.org/forums/foru...e-psmatch2-etc) and, ideally, the doubly-robust estimators (ref: http://www.stata-journal.com/article...article=st0149), for my uses.

I generally followed the code in the two references above, amending to be compatible with mi and svy. Below are the steps I've taken. However, I'm concerned with two anomalies in the output that make me think I may have done something wrong in the process:
(1) The output for the outcome models, weighted both for design effects and IPTW, reports a Population Size of 21.261939. The actual population size for these data, after weighting, is
(2) The difference between the coefficient returned by the model under IPTW and that returned through the DR process is, in many cases, very large: much larger than I would expect given the materials referenced above.

Here's what I've done. I may well be out of my league - any advice or correctives very welcome!

- Model propensity for treatment variable (F3_BACHELORS):
Code:
mi estimate, saving(hasba): svy: logit F3_BACHELORS <predictors>
- Compute predicted probabilities for treatment variable
Code:
mi predictnl propensity_bachelors = predict(pr) using hasba, storecompleted
- Transform propensity_bachelors into a weight:
Code:
gen invwt_f3_bachelors = F3_BACHELORS / propensity_bachelors + (1 - F3_BACHELORS) / (1 - propensity_bachelors)
egen i_f3_b_total = total(invwt_f3_bachelors)
gen normwt_f3_bachelors = invwt_f3_bachelors / i_f3_b_total
gen iptw_f3_bachelors_weight = normwt_f3_bachelors * F3BYPNLWT
- Reweight by the iptw weight (which is the design probability weight multiplied by the normed weight from the predicted probabilities):
Code:
mi svyset PSU, weight(iptw_f3_bachelors_weight) strata(STRAT_ID)
- Model the outcome of interest (in this case, F3D39):
Code:
mi estimate: svy: logit F3D39 ib0.F3_BACHELORS <predictors>
- Calculate DR estimators for the effect of F3_BACHELORS on F3D39 (code follows http://www.stata-journal.com/article...article=st0149 ):
Code:
mi svyset PSU, weight(F3BYPNLWT) strata(STRAT_ID)
mi estimate, saving(f3d39_1): svy, subpop(F3_BACHELORS): ///
    logit F3D39 ///
        ib5.bypared_reduced ib4.byincome_reduced ib5.BYRACE_AP i.BYSEX ///
        i.F2D11_R     i.F3A14A i.F3A14B i.F3A14C i.F3A14D i.F3A14E i.F3A14F ///
        i.F1OCC30_REDUCED F1RHEN_C F1RHMA_C F1RHSC_C F1RHSO_C F1RHCO_C F1RHFO_C ///
        STEM_COUNT SOCSCI_COUNT HUMARTS_COUNT PROF_COUNT VOC_COUNT  
mi estimate, saving(f3d39_0): svy, subpop(F3_NO_BACHELORS): ///
    logit F3D39 ///
        ib5.bypared_reduced ib4.byincome_reduced ib5.BYRACE_AP i.BYSEX ///
        i.F2D11_R     i.F3A14A i.F3A14B i.F3A14C i.F3A14D i.F3A14E i.F3A14F ///
        i.F1OCC30_REDUCED F1RHEN_C F1RHMA_C F1RHSC_C F1RHSO_C F1RHCO_C F1RHFO_C ///
        STEM_COUNT SOCSCI_COUNT HUMARTS_COUNT PROF_COUNT VOC_COUNT  
mi predictnl f3d39_mu1 = predict(pr) using f3d39_1, storecompleted
mi predictnl f3d39_mu0 = predict(pr) using f3d39_0, storecompleted
generate f3d39_iptw = (2 * F3_BACHELORS - 1) * F3D39 * ///
    propensity_bachelors
generate f3d39_mdiff1 = (-(F3_BACHELORS - propensity_bachelors) * ///
    f3d39_mu1 / propensity_bachelors) - ((F3_BACHELORS - ///
    propensity_bachelors) * f3d39_mu0 / (1 - propensity_bachelors))
display as text "DR estimator for F3D39 = " as result f3d39_iptw + f3d39_mdiff1

Testing for systematic relationships

$
0
0
Dear Statalist,

I am trying to test whether there is a systematic relationship between GDP per capita growth and total hours worked, ICT capital, non-ICT capital and MFP (using data off the OECD iLibrary), over time and looking at 20 countries (so using panel data).

My first idea was to run stationarity tests to then test for cointegration relationships, but I'm not sure if that is correct. Does anyone else have any suggestions?

Thank you,
Giulia

Cox proportional hazard model and the PH assumption

$
0
0
Hello everybody,

im am running a cox regression in order to assess whether individuals return to the labour market earlier after some legislation change. My treatment group is consisting of individuals in the year after the new law and the control group of individuals before the law change. The treated variable takes on value 1 if individuals are in the treatment group and 0 otherwise. Now, estimation results show that individuals respond to the law and return earlier but the treated variable does not survive the proportinality assumption test. (Which in my view is logic because whether you get into the treatment group depends on the timing of the law, or am i mistaken?)

My question now:

(1) when I interact time and treated to a new variable (interact=treated*_t), does the interpretation of the estimate stay the same? the treated hazard ratio was 1.3 meaning a higher chance to face the failure event. The interact estimate is 1.02 still meaning that the failure event is more likely for the treatment group?! , although there is a big drop in the effect.

(2) do i have to include both variables then, treated and interact?

If I need to clarify anything further, pls tell me

Kind regards
Tim

Loop regression of panel data

$
0
0
Dear Statalists,

I am estimating the panel data for stock returns of individual firm on both market returns and returns on risk free rate. The data is based on the daily frequency. All I need is running the regression for each firm every year (daily frequency) and get the standard deviation of reported residuals, which could be done through a loop with Stata. However, I meet two problems when running such loop:

1. A number of residuals are missing e.g., no observations r(2000). I put the copy of this in the attachment.

2. I have randomly checked some firms with residuals through loops and residuals from simple regressions, and find that they produce different residuals.

My loop code is:

egen group=group( permco year)
gen residual=.
su group, meanonly
forv i=1/`r(max)' {
regress retx sprtrn rf if group == `i'
predict temp, residuals
replace residual=temp if group == `i'
drop temp
}


where permco is the identifier of the firm, year is the calendar year. Because I need to run the regression model every year based on daily stock return data, I generate the group based on year level, that is. egen group=group(permco year)

I have also attached the sampled file as well.

Many thanks.

Best wishes,
Cong

xtunitroot interpretation

$
0
0
Can you please help me interpret the following results? I would say that it is stationary but then the Inverse normal is not significant.

This is a balanced panel data.

Code:
 xtunitroot fisher urbanisation, dfuller drift lags (0) demean

Fisher-type unit-root test for urbanisation
Based on augmented Dickey-Fuller tests
-------------------------------------------
Ho: All panels contain unit roots           Number of panels  =     27
Ha: At least one panel is stationary        Number of periods =     21

AR parameter: Panel-specific                Asymptotics: T -> Infinity
Panel means:  Included
Time trend:   Not included                  Cross-sectional means removed
Drift term:   Included                      ADF regressions: 0 lags
------------------------------------------------------------------------------
                                  Statistic      p-value
------------------------------------------------------------------------------
 Inverse chi-squared(54)   P       191.6609       0.0000
 Inverse normal            Z        -1.3179       0.0938
 Inverse logit t(139)      L*       -2.9000       0.0022
 Modified inv. chi-squared Pm       13.2464       0.0000
------------------------------------------------------------------------------
 P statistic requires number of panels to be finite.
 Other statistics are suitable for finite or infinite number of panels.
Viewing all 65428 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>