Problems with merging datasets - unique ID repeats for twins/triplets for some datasets but not others

February 28, 2017, 8:07 am

≫ Next: Graph - Rarea made on PC invisible on Mac.

≪ Previous: testing for equality of regression coefficients for separate samples - growth curve models and mi estimation

Hello everyone,

I have run into a problem when attempting to merge datasets on stata 14. I am using panel data, with three cohorts, called the “Child of the new millennium”. It contains data on children born in 2000. I am currently attempting to merge the datasets within each cohort. But the problem is, is that the ID does not uniquely identify each response in some datasets – in particular data-sets asking specific questions about the child. This is because the sample includes twins/triplets so, for about 300-400 observations, there is more than one code. There is a dummy variable included which allows me to identify twins and triplets but apart from that there is no way to identify them.

But in other datasets, the ID appears only once so I am able to successfully merge in these.

Unfortunately, I need to keep the twins and triplets in my sample – I was wondering if anyone could help? I'd really appreciate any advice that can be given

I am new to both stata and this forum, please let me know if any more information is needed!

Thanks in advance, Kishan

↧

Graph - Rarea made on PC invisible on Mac.

February 28, 2017, 8:15 am

≫ Next: Display narrow studies with narrow confidence interval in Forest plot (meta-analysis).

≪ Previous: Problems with merging datasets - unique ID repeats for twins/triplets for some datasets but not others

I create an area graph on my Windows machine:

Code:

twoway (rarea np5 np95 dist, sort fcolor(gray) fintensity(inten20) lcolor(white)) (line zero dist, sort msymbol(none) clcolor(black) clpat(dash) clwidth(thin)) (line yhat dist, sort msymbol(none) clcolor(black) clpat(solid) clwidth(medium)) (scatter dfact dist if sig5, sort msymbol(+) msize(medium) mcolor(red)) (scatter dfact dist if sig10, sort msymbol(O) msize(medium) mcolor(red)) (scatter dfact dist if (!sig5 & !sig10), sort msymbol(Oh) msize(medium) mcolor(black)), legend(off) graphregion(color(white)) xtitle("Distance to Factory") ytitle("Estimated Effect") xlabel(0(1)20) xsc(r(0 20)) ylab(, nogrid) title("")

graph export myPlot.pdf

On my machine, a png screenshot of this PDF looks like:
Array

However, my collaborator on a Mac cannot see the shaded region; it appears white. Have others faced this problem? How would I fix it?

I've tried other settings for the plot by no luck.

↧

Display narrow studies with narrow confidence interval in Forest plot (meta-analysis).

February 28, 2017, 8:18 am

≫ Next: Crash when using xtabond and/or xtdpdsys

≪ Previous: Graph - Rarea made on PC invisible on Mac.

I have 28 studies and which of them have wide confidence interval(1.2-98) so that it is not let to line of other studies to show. I used command such as xlabel(0.01, 0.05, 0.1, 0.5 , 1, 5 ,10 ,100) but it is not usefulness. please guide me

↧

Crash when using xtabond and/or xtdpdsys

February 28, 2017, 9:33 am

≫ Next: Measuring distance between control and treatment villages

≪ Previous: Display narrow studies with narrow confidence interval in Forest plot (meta-analysis).

Hey,

I try to estimate a dynamic panel using xtabond (and I've also tried xtdpdsys). Unfortunately Stata (and a few seconds later the whole computer) crashes without giving an error message. Looking at the task manager shows that Stata uses 10GB+ in a few seconds. Before running the commands I used xtset (in case the commands make problems if this hasn't been done). Any suggestions what could be wrong? Thanks in advance!

Best,
Thomas

↧

Measuring distance between control and treatment villages

February 28, 2017, 10:05 am

≫ Next: diagnostic methods for GEE models

≪ Previous: Crash when using xtabond and/or xtdpdsys

Hey,
I have a dataset containing the name of the villages, their coordinates (longitude and latitude) and a dummy variable representing the fact of being assigned to a program (1= treatetment, 0=control). I want to calculate the shortest distance of each "control" village to a "treated" one.
I tried to use the command Geonear in the following way:

geonear Village Latitude Longitude using "file", n(Village Latitude Longitude) ignoreself

But of course it is calculating the shortest distance to ANY villages, not to just the control to a treated one. I wonder if there is an option to integrate the above-written formula, or if there is an other way to get to my goal.

Thank you in advance for your help,

Matteo

↧

diagnostic methods for GEE models

February 28, 2017, 10:32 am

≫ Next: Probit/Logit with Panel Data. Should I use probit or xtprobit?

≪ Previous: Measuring distance between control and treatment villages

Hi Stata team,
I have continuous outcome that is measured repeatedly over time and I wanted to run the analysis trying the Random effect model but the outcome is not normally distributed. so I tried the ladder command to check what would be the most optimum model and according to the Q-Q plot, the log of the outcome was the most normal.
so I tried the xtgee command to build the model and use the log as the link. however I am not sure of my model because I don't know how to run the post-modelling diagnostics for GEE in STATA.
I was wondering if there is a way to check the residuals with the GEE model. I am not sure if I provided in my question all the elements that are needed for the answer.

thank you very much

↧

Probit/Logit with Panel Data. Should I use probit or xtprobit?

February 28, 2017, 11:08 am

≫ Next: Drop Records Missing Majority of Observations of Variables

≪ Previous: diagnostic methods for GEE models

Hi everyone,

I am using STATA 14 to work with a panel data set of the United States from 2007 to 2015. I want to estimate a discrete choice model but I am not sure whether I should use:

probit dep indep ..., vce (cluster stateid)

or:

xtprobit dep indep ..., pa vce(robust)

I am concerned about serial correlation in my data which is why I am shying away from using a Logit model with fixed effects and using vce(bootstrap) doesn't seem to work:

xtlogit dep indep ..., fe vce(bootstrap)

Essentially, my question is what estimation method to use. Both probit and xtprobit give very different results. Any suggestions would be greatly appreciated!

↧

Drop Records Missing Majority of Observations of Variables

February 28, 2017, 11:38 am

≫ Next: Error Clustering by Industry or Firm ?

≪ Previous: Probit/Logit with Panel Data. Should I use probit or xtprobit?

I have a data set in which several records are missing observations for the majority of the variables (in Excel, several rows would be blank for the majority of the columns). Is there a command to remove these particular records. I don't want to haphazardly drop records that might be missing the occasional observation, just those that are mostly missing (195 missing out of 210 variables).

↧

Error Clustering by Industry or Firm ?

March 1, 2017, 2:49 am

≫ Next: FTA treatment effects using panel data

≪ Previous: Drop Records Missing Majority of Observations of Variables

Hi,

I have a massive panel data set with more than 20.000 observations.

I want to regress stock returns on textual sentiment in annual reports.

Other research papers include YEAR and INDUSTRY fixed effects (done).

In addition some also include error clustering by INDUSTRY or FIRM.

Is it even possible to include error clustering by INDUSTRY when I already included INDUSTRY fixed effects ?

Because when I cluster errors by INDUSTRY Stata says "panels are not nested within clusters".

Where is the difference between error clustering by INDUSTRY of FIRM in the light of the already included fixed effects ?

Thank you !

↧

FTA treatment effects using panel data

March 1, 2017, 3:09 am

≫ Next: IV Regression with multiple endogenous regressors

≪ Previous: Error Clustering by Industry or Firm ?

Hi everyone,
I am studying the impact of the US FTAs (Free Trade Agreements), individually signed with 20 countries around the world, and have been “phased in” over time to 2013. I have unbalanced panel data on US exports to 145 countries in the period 1991-2015 (25 years). There is 4% of data omitted (because there was no trade between some countries).
My binary treatment variable FTA is endogenous, since FTAs are not generated at random, but are the result of the self-selection of the countries. The US-Israel FTA has been in force since 1985 and it’s excluded from the sample. I also use other covariates
As the treated group of countries started treatment (FTA) at different points in time, and Clyde Schechter says in a recent post "... in the classic DID design, all of the subjects in the treated group begin treatment at the same point in Time, and data are available on both treated and untreated groups both before and after that time. ", DID analysis would not be the most appropriate. Right?

If this is so, then, is there any way to estimate the impact (ATE or ATT) of the set of FTAs (not individually) on US trade?

On the other hand, based on the work of Bergstrand (2007, 2015) and Anderson (2016), I use a gravity model that takes into account the "multilateral resistance terms" originally suggested by Anderson and van Wincoop (2003).
(1) lnexp _ijt = ß₀ + ß₁lngdp_it + ß₂ lngdp_jt + ß₃ lndist_ijt + ß₄ fta_ijt + κ_ij + ξ_it + ζ_jt + ɛ_ijt

However, in this case there is only one exporting country (US), so we could rewrite equation (1) as:
(2) lnexp_1jt = ß₀ + ß₁lngdp_1t + ß₂ lngdp_jt + ß₃ lndist_1jt + ß₄ fta_1jt + κ_1j + ξ_1t + ζ_jt + ɛ_1jt

and,
(3) lnexp_jt = ß₀ + ß₁lngdp_it + ß₂ lngdp_jt + ß₃ lndist_jt + ß₄ fta_jt + κ_j + ξ_t + ( ζ_jt + ɛ_jt )

As the distance (lndist) is a time-invariant variable and using “trade shares” (trade flows scaled by the product of GDPs) in the LHS of the equation:
(4) ln(exp_jt / lngdpi_t * lngdp_jt ) = ß₀ + ß₄ fta_jt + κ_j + ξ_t + ( ζ_jt + ɛ_jt )

In stata:
xtset code year
xtreg lnexp lngdp lngdp_us fta i.year, fe vce(cluster code)

This would means that if I only use time-effects and country-effects in my model (4) (because there is only one exporting country (i) ), will I be controlling the “multilateral resistance terms” in a manner similar to pointed out by Anderson y van Wincoop, y Baier & Begstrand (2007)-BB with respect to equation (1)?.

By the way, in this paper**, BB estimate Eq. (1) using bilateral-pair (ij) fixed effects along with the country-and-time (it, jt) effects, where the country-and-time (it, jt) effects account explicitly for the time-varying “multilateral price terms”. In page 89, they said “The average treatment effect of an FTA (in our model ß₄ ) of 0.46 implies again that an FTA increases trade by a cumulative amount of about 58%.” (1- exp(0.46) = 0.58).

Is this accumulated value considering that all treated countries have no started their FTAs at the same time?

I have other concerns, but for today I believe it’s enough. I would appreciate any suggestions. Thanks in advance.

-----------------------------
** Baier & Bergstrand, Do free trade agreements actually increase members' international trade?. Journal of International Economics 71 (2007) 72–95

↧

IV Regression with multiple endogenous regressors

March 1, 2017, 3:45 am

≫ Next: Unusual p-values with -reghdfe-

≪ Previous: FTA treatment effects using panel data

I'm currently a fourth year student trying to write my dissertation. My stata knowledge is fairly basic but I'm trying to run a 2sls regression which has 2 endogenous variables (call them A and B) for which I have 2 instruments for each. I need to run the regression where both variables are simultaneously treated as exogenous. I've tried using the standard ivreg but included two sets of brackets: ivreg Y (A = x1 x2) (B = x3 x4) but this hasn't worked
Can anyone suggest the best way to approach this? I know this is potentially a fairly basic question but its the crucial section of my paper so i want to make sure i get it right. I've trawled the internet for hours but to no avail
Thanks
Iain

↧

Unusual p-values with -reghdfe-

March 1, 2017, 4:32 am

≫ Next: Porportion test tertiles

≪ Previous: IV Regression with multiple endogenous regressors

Hi everyone,

I am new to using reghdfe. I am running a regression in which I am using absorbing with fixed effects two variables. I am also clustering the errors on those two variables. I get large t-stats such as 4.73 for which the p-value is only 0.018. Why is it that the p-values are not as we would get with a regular regression? I tried canceling the error clustering and using just robust and that resulted in the p-values and t-stats being aligned to what you'd normally see.

Thanks for your help in understanding this issue.

↧

Porportion test tertiles

March 1, 2017, 5:35 am

≫ Next: stata ml estimation

≪ Previous: Unusual p-values with -reghdfe-

Hello,

I am currently having a hard time figuring out whether it is possible to divide a certain variable first into 3 parts (tertiles) based on the xtile command. Then I would like to test the top vs the bottom tertile. I always receive the error that even though I have created this tertile split, that my variable still contains three possibilities even though the proportion test allows only to test for instance two things such as proportion A against proportion B. So far I could not come up with any work around by creating a new variable. Any help on this would be highly appreciated.

Best, Max

↧

stata ml estimation

March 1, 2017, 5:55 am

≫ Next: xmlsave consumers

≪ Previous: Porportion test tertiles

Hello everyone,

I am trying to create a clogit equivalent. Below is my program.

Note that :
1. x1 x2 x3 id choice are harded-coded variables. They have the same name as the one I loaded into the dataset.
2. I am trying to estimate beta1 to beta3, which have nonlinear effect

program myconditional_logit
args todo beta1 beta2 beta3 beta4 lnL
version 11
tempvar den p last xb

gen double `xb' = `beta1' * x1^`beta1' + `beta2' * x2^`beta2' +`beta3' * x3 ^`beta3'
local y choice
local by1 id
sort `by1'
quietly{
by `by1': egen double `den' = sum(exp(`xb'))
gen double `p' = exp(`xb')/`den'
mlsum `lnL' = `y' *log(`p') if `y'==1
if (`todo'==0 | `lnL' > =.) exit
}
end

My function call is
ml model d0 myconditional_logit () () () ()

However, when I try to run the program. It issues the following error:

myconditional_logit 0 __000009 __00000A __00000B __00000C
- `begin'
= capture noisily version 13: myconditional_logit 0 __000009 __00000A __0
> 0000B __00000C
----------------------------------------- begin myconditional_logit ---
- args todo beta1 beta2 beta3 beta4 lnL
- version 11
- tempvar den p last xb
- gen double `xb' = `beta1' * x1 + `beta2' * x2 +`beta3' * x3 +`beta4'
> * x4
= gen double __00000G = __000009 * x1 + __00000A * x2 +__00000B * x3 +_
> _00000C * x4
matrix operators that return matrices not allowed in this context
------------------------------------------- end myconditional_logit ---
- `end'

I am thinking this is because stata treat x1 as a full vector. Is there any way I can make the x1 to be observation specific?

Thanks everyone!

↧

xmlsave consumers

March 1, 2017, 6:21 am

≫ Next: Graph combine

≪ Previous: stata ml estimation

Are there any tools/packages/applications that readily consume XML files produced with Stata's xmlsave? (other than Excel for doctype(excel))

Thank you, Sergiy

↧

Graph combine

March 1, 2017, 7:23 am

≫ Next: Rainfall data

≪ Previous: xmlsave consumers

Hi everyone,

I have two separate graphs which I want to plot together in one graph.
Graph 1 is

Code:

graph twoway line valuecost demand, name(g1,replace) sort || line valuecost supply, ytitle( "Price" ) xtitle( "Quantity" ) yline(60, lpattern(dash)) legend(label(1 "Demand") label(2 "Supply"))

Array

Graph 2 is

Code:

graph twoway scatter price tradenumber,name(g2,replace) by(period,  row(1) compact) xlabel(minmax) yline(60, lpattern(dash)) ytitle( "" ) xtitle( "Trade number" ) yscale(off)

Array

My first question is, although I specified in my command that I want the y axis to be turned off, I still get a graph with y axis. Is the something I am missing?

Next I try to combine them as follows:

Code:

graph combine g1 g2,name(g3, replace) imargin(0 0 0 0)

Array

My goal is to combine them without any space between them. That is the reason I want to suppress the yaxis in graph 2 above. Any ideas on how I could achieve this?

Thank you

P.S- I am using Stata 11.1.
My dataset Looks like this

Code:

----------------------- copy starting from the next line -----------------------


	Code:
	* Example generated by -dataex-. To install: ssc install dataex
clear
input byte(tradenumber price period valuecost demand supply equilibriumprice equilibriumquantity)
 1 100 1 100  0 30 60 20
 2  25 1 100  8 30 60 20
 3 100 1  90  8 30 60 20
 4  99 1  90  8 30 60 20
 5 100 1  80  8 30 60 20
 6 100 1  80 16 30 60 20
 7  78 1  70 16 30 60 20
 8  80 1  70 16 20 60 20
 9  85 1  60 16 20 60 20
10 100 1  60 24 20 60 20
11  80 1  50 24 20 60 20
12 100 1  50 24 20 60 20
13  60 1  40 24 20 60 20
14  47 1  40 24 20 60 20
15  50 1  30 24 20 60 20
16  55 1  30 24 10 60 20
17  35 1  20 24 10 60 20
18  75 1  20 24 10 60 20
19  40 1  20 24 10 60 20
20  60 1  10 24 10 60 20
end
------------------ copy up to and including the previous line ------------------

↧

Rainfall data

March 1, 2017, 8:17 am

≫ Next: time series probit regression analysis

≪ Previous: Graph combine

Hi all,

I am working with monthly rainfall data over years 1993-2010, for seven district. Each district have certain number of grids with information on longitude and latitude, monthly rainfall, area in sqkm. My problem concerns when creating district variable; whenever two districts share the same values for longitude and latitude, only information for one of the two districts is kept with my current coding.

Ex.
gen district = .
/*district 1 */
replace district =1 if a==77.25 & b==17.25
replace district =1 if a==77.25 & b==17.50
replace district =1 if a==77.25 & b==17.75

/*district 2*/
replace district=2 if a==77.25 & b==17.25
replace district=2 if a==77.25 & b==17.50
replace district=2 if a==77.50& b==17.25

Evidently for district 1, only one case will be kept because the first two cases will be overwritten by the "replace" command in district 2. How can I keep all information for each district despite the overlap.

Thanks !

↧

time series probit regression analysis

March 1, 2017, 9:19 am

≫ Next: Replicating the values of a variable to fill specific gaps

≪ Previous: Rainfall data

Hi,

I am attempting to use STATA to run a probit regression on time series using the xtprobit function however I am not getting the expected results.

I am trying to measure what causes football clubs to go insolvent I am regressing insolvency events against residuals from a fixed effects regression which represent shocks to a clubs league position caused by factors other than wages and another set of residuals which represent shocks to clubs revenue caused by factors other than league position.
I am also using the division in which the club were in at the date of the insolvency event as a dummy variable.
The results are not as expected in particular presence in the lower tiers of English football is being shown as insignificant (with the highest tier being the base group) - I don't understand this because almost all insolvency events took place in the lower tiers so would expect this to return significant results.

Therefore I wanted to check with people that I have my data in my STATA data set laid out in the right way. I have a database that consists of the clubs that participated in the football league in every season from the 1996/97 season to the 2014/15, for every season a club has competed in the football league (the top 4 divisions) there is a seperate entry. It also includes which league they were in in that season. I also have column named insolvency event and if the club had an insolvency event in that season there is a 1 in the column and a 0 if not

Any help for me on this matter would be much appreciated,

Thankyou

↧

Replicating the values of a variable to fill specific gaps

March 1, 2017, 9:46 am

≫ Next: Test

≪ Previous: time series probit regression analysis

Dear Statalisters,

I am working with a dataset originating from two different datasets, one with a series of variables regarding the household quality of a number of families, the other certain aspects of the quality of life of its members.

I need to work with both, but, logically, the number of observations is considerably greater in the "individual" dataset.

To fill the gaps in the "household" dataset, I would like to replicate the values of the "household" dataset's variables so that it features the same value each time the same family code is present in the "individual" dataset.

To further make myself clear: if person #1, #2 and #3 are all members of family #1, in which I have a corresponding value of 1000 for variable X, I would like for each member of family #1 to feature the value 1000 in variable X.

I think that some sort of use of the "by/bysort" command is what I need, but I am not sure.

I checked if this question was previously asked, but I don't think so, if that should be the case, please redirect me to the thread.

I thank in advance for any contribution!

Tommaso Bechini

↧

Test

March 1, 2017, 10:06 am

≫ Next: Xtabond

≪ Previous: Replicating the values of a variable to fill specific gaps

This is a test before submitting a post.

Thank you.

↧