Quantcast
Channel: Statalist
Viewing all 65454 articles
Browse latest View live

Missing data in variable

$
0
0
Hello,

I have a dataset of panel data with circa 5,000 observations. However, 3 independent variables (of the total 8) have some missing values. The amount of missing data per variable is around 100 to 200 observations and is sparsely distributed across the entire dataset (many non-missing values before and after the missing data points). I'm reading up on what the preferable way would be in handling these. Linear interpolation seems an option, but I'm having difficulty seeing what other variables the independent variables are a function of; multiple imputation seems somewhat overboard for this small amount of missing data; and mean replacement seems to be commonly advised against. Do you have any suggestions as for tackling this problem?

To be clear, this is regarding an economic dataset, where volatilities of stocks are studied using independent variables in the conditional variance of a GARCH model.

Thank you for your time and any help you can offer.

selective generation of graphs using pvar2

$
0
0
Code:
pvar2 roeendw Lmiss lconw lcoffw DLLw eqcgtaw Loss2w gdpg hh inflation st lt, lag(2) gmm monte 500 12 2 decomp 30 5 getresid
this is giving me the impact of each variable on the others.
1. I am wondering if I can generate graphically only the impact of Lmiss on the other variables. IF so how can I adjust my codes to achieve my goal?
2. Also the graphs are generating significance interval based an significance level of 5%. How can I adjust it to make the significance level 10%?


Create a dummy bench marked to group mean (Panel data) and reset its value if adjacent observation is a certain value

$
0
0
Hi everyone,

I have a panel data with over 10000 observation. Below is a few data points to help explain my question.

the data shows industry (sic4), year, and the tariff rate change from previous year, a negative value xvar indicates the rate decreased by that much from prior year,

I want to flag the years that had large negative changes benchmarked to industry mean. By large, I mean if the change is two times or larger, e.g. if industry 2011, over all years, pretend its mean xvar is -0.01 , and in year 1985, xvar= - 0.056, it is more than 5 times of mean, so Large_dummy=1

step #1 i can do

Code:
sort sic4
          
by sic4: egen m_xvar = mean(abs(xvar))

gen flag=1 if xvar!=. & xvar<0 & abs(xvar)>2*m_xvar

I have trouble with step #2 and #3 .

step#2, I want to reset Large_dummy to zero if the large negative change in year t was followed by a large positive change in year t+1. or a large positive change in year t was preceded by a large negative change in year t-1. The idea is not to count transitory change in tariff rate for this industry, because the adjacent changes negate each other.


step#3, suppose I have multiple Large_dummy=1 in an industry, how do I flag the observation that had the largest change. I guess after step 2 is done, it will be easier .

Thank you so much ,

Rochelle

Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input int(sic4 year) double xvar
2011 1975 .37097025 
2011 1976 -.59929764 
2011 1977 .24782744 
2011 1978 -.75217545 
2011 1979 -.49662915 
2011 1980 -.59931552 
2011 1981 -.25636095 
2011 1982 .0489553 
2011 1983 .05009897 
2011 1984 .02054092 
2011 1985 -.05650408 
2011 1986 -.12782624 
2011 1987 -.14867632 
2011 1988 -.0004699 
2011 1989 . 
2011 1990 -.00403237 
2011 1991 -.1436553 
2011 1992 -.06235075 
2011 1993 -.16818738 
2011 1994 -.0305934 
2011 1995 .15119302 
2011 1996 -.29577011 
2011 1997 -.08884877 
2011 1998 .20330644 
2011 1999 -.03562927 
2011 2000 .14391983 
2011 2001 -.18157023 
2011 2002 -.18734139 
2011 2003 .0017103 
2011 2004 .78501451 
2011 2005 .11708868 
2013 1974 . 
2013 1975 .37096992 
2013 1976 -.59929812 
2013 1977 .24782838 
2013 1978 -.75217593 
2013 1979 -.49662915 
2013 1980 -.59931582 
2013 1981 -.25636101 
2013 1982 .04895521 
2013 1983 .05009923 
2013 1984 .02054102 
2013 1985 -.0565042 
2013 1986 -.12782653 
2013 1987 -.14867611 
2013 1988 -.00046983 
2013 1989 . 
2013 1990 -.06166172 
2013 1991 -.00946093 
2013 1992 -.04130936 
2013 1993 -.10242367 
2013 1994 -.29202628 
end
[/CODE]

Zscores or Theta EAP

$
0
0
I have two questions.
I am working on a data which contains student scores for English and Math. These scores are the dependent variables in my model, I want to know whether I should convert the scaled scores in to z-scores or Theta EAP? what will be the best technique if I want to measure the effect of independent variable on test scores?

How to make bigger graph

$
0
0
Hello,

I've been trying to visualise different effect of IV on DP for certain subgroups in used dataset. I've used this code:

Code:
graph twoway scatter DP IV, msymbol(Oh) || lfit DP IV, lwidth(medthick) || , xlabel(-1(1)4, grid) ytitle("graph") by(variable name of the subgroup, legend(off) note(""))
The problem is that when the process is done, the output sends 130+ "mini"graphs, because of multilevel structure of the data.

I want to ask you, if you know how to make each graph bigger (so I can actually present the results) or if you can change the command to make 100+ single graphs (one for each cluster).

Thank you for help.


P.S.: I am including the output graph:

Array

Event window and estimation window ERROR

$
0
0
Dear Statalist,

I'd really appreciate any help members have to offer with my problem.

I'm running an event study on the princeton method (http://dss.princeton.edu/online_help...ventstudy.html). I've successfully merged my market data , with 316 events over 8 years.
My event window is (-5;-2) and estimation window is (-205;-6)

I used the following codes:

sort company_id date
by company_id: gen datenum=_n
by company_id: gen target=datenum if date==event_date
egen td=min(target), by(company_id)
drop target
gen dif=datenum-td

by company_id: gen event_window=1 if dif>=-5 & dif<=-2
egen count_event_obs=count(event_window), by(company_id)
by company_id: gen estimation_window=1 if dif<-6 & dif>=-205
egen count_est_obs=count(estimation_window), by(company_id)
replace event_window=0 if event_window==.
replace estimation_window=0 if estimation_window==.
tab company_id if count_event_obs<4
tab company_id if count_est_obs<200
drop if count_event_obs < 4
drop if count_est_obs < 200

However, both my event and estimation window turned out to be all 0. Therefore, when I used the drop command, there is no value lelf.
Is there any problem with my code.
Thank you in advance.

Array

Adding a custom column to esttab

$
0
0
Hello, could someone help me figure out how to achieve the following output using estout/esttab?
Array

Basically, I would like the first column to include "regular" results from a regression model, and I'd like the second column to have "custom" values that I would like to define.

I tried a few different approaches. The one that came closest was to use use return repost and replace values in e(b) with my custom values (see below). The trouble is, I dont know how to show the std. errors and stars for the first column, but not for the second.

Any help? Thanks in advance!
------

Code so far:

est clear
eststo: reg price weight mpg

program define alterations, eclass
estimates restore est1
matrix m = e(b)
matrix m[1,1]=30
matrix m[1,2]=26
matrix m[1,3]=35
ereturn repost b=m
end

// change coeffs
alterations

eststo est2

eststo est3: reg price weight mpg

esttab est3 est2





Insignificant interaction term

$
0
0
Dear Statalist,

I have a question that is not so much about Stata command but rather about statistics in general.

I am working on comparing the treatment effect of an intervention on women's empowerment in Uganda and Tanzania. The intervention is exactly the same. In order to do so, I run a regression model in which I include a country dummy variable (1 for Tanzania and 0 for Uganda) and an interaction term between country and treatment in order to capture the heterogeneity of the treatment effect.

The output seems to be strange for me. Here is the treatment effect in Tanzania (when i run the separate regression for each country): Array

And below is the treatment effect in Uganda: Array

From the output I can say that the intervention does not have impact on share of time on reproductive and productive work in Uganda but there is significant impact in Tanzania. When I include the interaction term to see the heterogeneity of treatment effect, this is what get:
Array

My question is:

The p-value of interaction term for productive work is not significant. What could I conclude from this?Does this mean that there is no heterogeneity in the treatment effect between the 2 countries? If I look at the separate model, there should be stronger impact in Tanzania than in Uganda?

Thank you very much !

Lan.

Converting string to time variable

$
0
0
Hey, I have monthly data in the following format:

1970M1
1970M2
1970M3
...
2017M12


How can I convert this into time variable?

Many thanks in advance and any help will be appreciated.

Predicting xb with (temporary) alterations to x

$
0
0
I realize there must be an easy option that I'm missing here... I want to predict "yhat" after a regression, with various alterations to the matrix of covariates. Of course I can do this by changing the X matrix, but if I want to do this over and over, basically predicting the outcomes under various X scenarios, it becomes a pain to change X back and forth from it's original values. I thought either "predict" or "margins" would allow me to do this, but predict doesn't allow the "at" option, and margins doesn't seem to have an option for actually predicting a variable (aka a value for each observation)? Is there an efficient way to do this that I'm missing? Thanks!

System GMM xtabond2

$
0
0
Dear all,

I'm currently investigating the relationship between carbon emissions and socioeconomic variables for Brazilian states. My data has 27 individuals (states) and 6 years (5-year interval 1990-2015). According to the literature review my regressors are likely to be endogenous then I've decided to run the system GMM estimator by using the program xtabond2. However, as a new Stata user, I'm not convinced if I'm doing it correctly due to the unusual output I have.

My variables are: I = carbon emissions, P = population, Arpc = per capita GDP and T = technology. I've also included year fixed effects.

Code:
xtabond2 L(0/1).I P Arpc T i.AnoStata, gmmstyle( L.(I P Arpc T), laglimits(1 2) collapse equation(diff)) gmmstyle( L.(I P Arpc T
> ), laglimits(0 0) collapse eq(level)) ivstyle(i.AnoStata, eq(level)) twostep robust

Favoring space over speed. To switch, type or click on mata: mata set matafavor speed, perm.
Warning: Two-step estimated covariance matrix of moments is singular.
  Using a generalized inverse to calculate optimal weighting matrix for two-step estimation.
  Difference-in-Sargan/Hansen statistics may be negative.

Dynamic panel-data estimation, two-step system GMM

Group variable: Ufs                             Number of obs      =       135
Time variable : AnoStata                        Number of groups   =        27
Number of instruments = 17                      Obs per group: min =         5
Wald chi2(10) =  2.00e+06                                      avg =      5.00
Prob > chi2   =     0.000                                      max =         5

Corrected
I       Coef.   Std. Err.      z    P>z     [95% Conf. Interval]

I
L1.    .7259731    .206855     3.51   0.000     .3205448    1.131401
            
P    .2356326   .2079675     1.13   0.257    -.1719761    .6432413
Arpc      .08098   .2533465     0.32   0.749    -.4155699      .57753
T     .051556   .2607521     0.20   0.843    -.4595087    .5626208
            
Ano
1990            0  (empty)
1995      1.93051   1.654896     1.17   0.243    -1.313027    5.174047
2000     2.019052   1.712828     1.18   0.238    -1.338029    5.376133
2005      1.83546   1.831226     1.00   0.316    -1.753678    5.424598
2010     1.981132   1.858962     1.07   0.287    -1.662367    5.624631
2015     1.997801   1.917377     1.04   0.297    -1.760188     5.75579
            
_cons           0  (omitted)

Instruments for first differences equation
GMM-type (missing=0, separate instruments for each period unless collapsed)
L(1/2).(L.I L.P L.Arpc L.T) collapsed
Instruments for levels equation
Standard
1990b.Ano 1995.Ano 2000.Ano 2005.Ano 2010.Ano 2015.Ano
_cons
GMM-type (missing=0, separate instruments for each period unless collapsed)
D.(L.I L.P L.Arpc L.T) collapsed

Arellano-Bond test for AR(1) in first differences: z =  -2.23  Pr > z =  0.026
Arellano-Bond test for AR(2) in first differences: z =   1.97  Pr > z =  0.049

Sargan test of overid. restrictions: chi2(6)    =  23.71  Prob > chi2 =  0.001
(Not robust, but not weakened by many instruments.)
Hansen test of overid. restrictions: chi2(6)    =   7.12  Prob > chi2 =  0.310
(Robust, but weakened by many instruments.)

Difference-in-Hansen tests of exogeneity of instrument subsets:
GMM instruments for levels
Hansen test excluding group:     chi2(2)    =   4.11  Prob > chi2 =  0.128
Difference (null H = exogenous): chi2(4)    =   3.01  Prob > chi2 =  0.557
gmm(L.I L.P L.Arpc L.T, collapse eq(level) lag(0 0))
Hansen test excluding group:     chi2(2)    =   4.11  Prob > chi2 =  0.128
Difference (null H = exogenous): chi2(4)    =   3.01  Prob > chi2 =  0.557
iv(1990b.Ano 1995.Ano 2000.Ano 2005.Ano 2010.Ano 2015.Ano, eq(level))
Hansen test excluding group:     chi2(2)    =   5.94  Prob > chi2 =  0.051
Difference (null H = exogenous): chi2(4)    =   1.18  Prob > chi2 =  0.882

According to the above output, we can't reject the null hypothesis of the validity of the overidentifying restrictions, the values reported for the Diff-in-Hansen test are the p-values for the validity of the additional moment restrictions necessary for system GMM. Again, we do not reject the null that the additional moment conditions are valid. On the other hand, there is evidence for second-order autocorrelation and my lagged dependent variable is the only regressor which has statistical significance.


All comments and suggestions are very welcome and valuable.


Best regards,

IF commands based on p-values and nature of variable (e.g., continuous or categorical)

$
0
0
I have two related questions regarding the use of IF commands in STATA.

(1) IF Commands based on p-values

Let's say I have a simple regression command that is computing an interaction term between an IV (x1) and a Moderator variable (x2).
reg y c.x1##c.x2
If the interaction term above (c.x1##c.x2) is significant, it would warrant a probing of the simple slopes using the code below.

qui sum x2 if e(sample) == 1
local atmodlo = r(mean)-r(sd)
local atmodhi = r(mean)+r(sd)
margins, dydx(x1) at(x2=($atmodlo $atmodhi)) vsquish
margins, dydx(x1) at(x2=($atmodlo $atmodhi)) vsquish pwcompare(effects)

However, as I am running these commands within loops in Stata, to avoid excessive and unnecessary computations, I want Stata to ONLY compute the simple slopes IF the interaction term in the regression above is significant -- i.e. only IF the p-value in the regression output for c.x2##c.x3 is less than 0.05.

Ideally, my resulting command would look something like this:
if p-value from the regression above < .05 {
All the simple slopes code I want to run
}

Normally, one would do this by calling the desired scalarfrom the regression (in this case, the p-value) using "ereturn list". However, unfortunately Stata does not appear to store the p-value as a scalar.

Another option might be to grab the p-value from the matrixfor the equation. In this case, after running the initial regression I would type in:
matrix table = r(table)
matrix list table
This shows that the p-value for the interaction term is in 3rd column, 4th row of the table. So, ideally I would say something like:
if the value contained in location [3,4] of the the matrix table < .05 {
All the simple slopes code I want to run
}
Here is where I am stuck, as I don't know how to call that specific value from within the matrix.


(2) IF Commands based on nature of variable (continuous or categorical)

Relatedly, as the code for computing simple slopes will vary slightly depending upon whether the moderator variable is continuous or categorical, I want to tell Stata:
if variable x2 is continuous {
The specific simple slopes code I want to run
}

else if variable x2 is categorical {
The specific simple slopes code I want to run
}
end

I'm sure that there are simple commands that I'm overlooking. Any help would be appreciated.

Probability of type 2 error

$
0
0
Hi all,

I had run a fixed effect regression for my panel data and found the coefficients of my IVs insignificant. I want to determine the probability of type 2 error. Can anyone tell me how to do that in Stata?

Thank you in advance.

Umme

Keeping observations by year

$
0
0
I am working with three variables company, income, and year (2007 and 2017). How can I keep first 20 observations, based on income (descending order), by year ?

Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input double year str40 company float income
2017 "VN000000TCB8"   6445595
2017 "VN000000VPB6"   6294328
2017 "VN000000BID9"   5122230
2017 "VN000000MBB5"   3519627
2017 "ID1000113707"   3027466
2017 "ID1000092703"   2412458
2017 "ID1000094402"   2175824
2017 "COT29PA00025"   2169839
2017 "COB01PA00030"   2064130
2017 "KR7086790003"   2017741
2017 "ID1000099302"   1804031
2017 "COB51PA00076"   1275266
2017 "ID1000118508"   1220886
2017 "KR7024110009"   1219904
2017 "ID1000123904"   1159370
2017 "VN000000VIB1"   1124279
2017 "COT23PA00010"   1059992
2017 "VN000000TPB0"    938780
2017 "ID1000098205"    748433
2017 "LB0000010415"    747337
2017 "LB0000033441"    726701
2017 "JP3885780001"    603544
2017 "CLP1506A1070"    564815
2017 "JP3699200006"  506691.2
2017 "KR7138930003"  391170.2
2017 "COB23PA00067"    360712
2017 "JP3946750001"    312264
2017 "KR7139130009"    302208
2017 "ID1000128507" 263753.38
2017 "LB0000010530" 235525.05
2017 "COB02PA00012"    219998
2017 "LB0000010613"    200059
2017 "VN000000KLB8"    187711
2017 "KR7175330000" 185062.63
2017 "NGZENITHBNK9"    177614
2017 "JP3117700009"    158455
2017 "ID1000095508" 140495.53
2017 "ID1000103609"    135279
2017 "KR7192530004" 129014.34
2017 "ID1000100407"    121534
2017 "RU000A0JP5V6"    110400
2017 "UG0000000147"    106892
2017 "CLP102411004"    106006
2017 "CL0001692673"    103299
2017 "ID1000107402"     86140
2017 "GB0005405286"  85517.19
2017 "CLP8716M1101"  83134.05
2017 "NGUBA0000001"     76046
2017 "LB0000010332"  70045.25
2017 "ID1000128200" 69497.195
2007 "ZW0009011249"  53725208
2007 "ZW0009011967"  36635152
2007 "ID1000118201"   4838001
2007 "ID1000109507"   4489252
2007 "ID1000095003"   4346224
2007 "KR7105560007"   2757316
2007 "KR7055550008"   2475513
2007 "VN000000VCB4"   2397667
2007 "KR7053000006"   2201994
2007 "ID1000094204"   2116915
2007 "VN000000ACB8"   1732396
2007 "VN000000CTG7"   1149442
2007 "COB07PA00078"   1086923
2007 "KR7004940003"    960945
2007 "ID1000096605"    897928
2007 "ID1000098007"    770481
2007 "ID1000093701"    737905
2007 "JP3500610005"    645233
2007 "ID1000052400"    520719
2007 "VN000000EIB7"    463417
2007 "JP3890350006"    428400
2007 "ID1000093800"    420302
2007 "ID1000115702"    370667
2007 "VN000000HBB5"    365632
2007 "CLP0939W1081"  242287.7
2007 "HU0000061726"    208208
2007 "ID1000109101"  192751.5
2007 "KR7192520005"    160974
2007 "CLP321331116"  135375.8
2007 "VN000000SHB9" 126889.05
2007 "COB52PA00017"    121152
2007 "COB14PA00025"  116555.2
2007 "JP3892100003"    108315
2007 "RU0009029540"    106024
2007 "JP3405000005"    103820
2007 "JP3711200000"     79345
2007 "IS0000001469"     70020
2007 "CNE100000742"     69053
2007 "JP3305990008"     66289
2007 "JP3932800000"     64595
2007 "INE062A01020"   63643.8
2007 "KZ000A0KFFC1"     61354
2007 "CNE000001N05"     56248
2007 "KR7025610007"     54691
2007 "UG0000000386"  53017.36
2007 "KR7007800006"     52797
2007 "ID1000056302"  49554.09
2007 "KR7007200009"     45104
2007 "CNE100000RJ0"     43787
2007 "ID1000055205"  40744.45
end

Typing a command while a do file is running

$
0
0
Hey,

I executed a very long do file and somewhere in the middle found out that "set more off" didn't really work:

1. Does it mean that if I don't click more myself, the output will not continue? or is it just the output display that wouldn't continue?

and,

2. Is it ok to type set more off in the command window while my do file is running? Will the do file keep on running?


Thanks ahead,
Ben



how do I restrict xtline height?

$
0
0
Hey, I am using the command xline to generate these two red lines. I don't know how to make the height of the line stay within the confines of the graph region. 40 on the left is the highest value on the y-axis. I have seem something like "noextend" but not sure how to incorporate that code or if there are other solutions.

Array

constant term highly significant

$
0
0
I am using a difference in difference model, and when I ran the regression the constant term is highly significant like the t-value is 968.72. Is that normal or am I doing something wrong ? and what does it mean ?. thanks you very much for your help

Question about formatting for logistic regression

$
0
0
Hi all,

I am working on a dataset with 37 variables and 380 observations. Individuals from three non-randomized groups were given 0, 1 or 2 interventions.

Each individual was surveyed twice, once after the intervention, and once again two years later. During the second survey the same individuals were contacted and identified as the same with a unique caseid and a TimeSeries variable (0 if first survey, 1 if second). The second survey only repeated 4 questions from the first survey (the dependent measures) and collected demographic information including gender, age, education, occupation.

The data is inputted currently as follows in the long format. (REdithSS REdithWealth and Age are completely filled for all observations TimeSeries == 1).

Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input long Village float nQ19 long(REdithSS REdithWealth) int Age float TimeSeries
2 1 . . . 0
1 3 . . . 0
1 2 . . . 0
3 3 . . . 0
1 1 . . . 0
2 1 . . . 0
3 1 . . . 0
1 2 . . . 0
2 2 . . . 0
2 2 . . . 0
1 2 . . . 0
2 3 . . . 0
3 4 . . . 0
1 1 . . . 0
1 4 . . . 0
end
label values Village village
label def village 1 "Bugembe", modify
label def village 2 "Kijinjomi", modify
label def village 3 "Kyakabuzi", modify
label values REdithSS LEdithSS
label values REdithWealth LEdithWealth
The team I am with wants to make a new dependent variable that is the measure of the change for nQ19 between time 0 and time 1 and then run a multiple linear regression with Village (intervention received) and demographics as the predictors.

From what I understand this would require reshaping to wide format. However, when I try reshape this is what I get:

Code:
reshape wide nQ19 nQ20 nQ21 nQ22, i(caseid) j(TimeSeries)
(note: j = 0 1)
variable Age not constant within caseid
variable gender not constant within caseid
variable Rtribe not constant within caseid
variable Roccupation not constant within caseid
variable Reducation not constant within caseid
variable REdithWealth not constant within caseid
variable REdithSS not constant within caseid
    Your data are currently long.  You are performing a reshape wide.  You typed something like

        . reshape wide a b, i(caseid) j(TimeSeries)

    There are variables other than a, b, caseid, TimeSeries in your data.  They must be constant within caseid because that is the only way they can fit into wide
    data without loss of information.

    The variable or variables listed above are not constant within caseid.  Perhaps the values are in error.  Type reshape error for a list of the problem
    observations.

    Either that, or the values vary because they should vary, in which case you must either add the variables to the list of xij variables to be reshaped, or drop
    them.
r(9);
The Stata output seems to suggest that I would have to manually copy the responses from time 1 to time 0 for the same caseid to make the variables constant (the research team has already decided that they consider demographics to be consistent over the two year period).

Is there any other way to conduct this regression? If not, is there code that would allow me to copy responses over from time 1 to time 0 for the same caseid?

Thank you for your help,
Christopher Tracey

generating new variable based on changes in panel data

$
0
0
Hi,

I've constructed a dataset that includes information about the political party in office for every town between 2002-2014. Elections were held in 2002, 2008 and 2014, so there are 3 election cycles: from 2002-end 2007, 2008-end 2013, and 2014-end 2014 (when my observation period ends). I want to create a variable that indicates whether there was a change in political party for each town between election cycles and how many changes there were total (0, 1 or 2) for the entire observation period.

The variables I have are:

polparty_4: categorical variable labeled as left, center, right, other
code_insee: unique identifier for each town
year: from 2002-2014

I've created the following variables based on the above:
generate party07 = polparty_4 if year==2007
generate party08 = polparty_4 if year==2008
generate party13 = polparty_4 if year==2013
generate party14 = polparty_4 if year==2014

My plan was to create a binary variable coded as 0 if party07=party08 (or party13=party14) for a particular commune, and 1 otherwise.

However, I'm struggling to generate the binary variable grouped by town (code_insee), as well as the count variable for the total number of changes in each town.

Thanks for your help!

Replacing egen variable with mean

$
0
0
Hi all,

Is it possible to replace a value with the mean of other observations, as well as an if condition?


Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input float deal_id double patent_tar int class_tar float(cites5yr_tar overlap)
1 7056957 521  7 .
1 6491950 424  2 1
1 6746691 424  0 1
1 6020305 424  0 1
1 6676967 424  0 1
1 6596308 424  1 1
1 7056494 424  2 1
1 6155251 128  6 1
1 6406715 424  1 1
1 6818229 424  0 1
1 5855884 424  1 1
1 6524615 424  0 1
1 7011848 424  0 1
2 6165985 514  0 1
2 6906059 514  1 1
2 7410962 514  0 1
2 7262184 514  0 1
2 7074325 210  0 1
2 7479378 435  0 1
2 6300336 514  4 1
2 7282325 435  0 1
2 6001833 514  2 1
2 7081451 514  3 1
2 6407106 514  2 1
2 7273860 514  0 1
2 6069147 514  0 1
2 5914330 514  1 1
2 6648212 228  1 .
2 6040303 514  0 1
2 6204271 514  0 1
2 6482820 514  3 1
2 7041303 424  0 1
2 7439235 514  0 1
2 7309706 514  0 1
2 7074822 514  0 1
2 7592344 514  0 1
2 6054461 514  2 1
2 7427611 514  0 1
2 6666171 119  1 .
2 6930207 568  0 1
2 6602880 514  2 1
2 7368582 549  0 1
2 7524833 514  0 1
2 6854427 119  0 .
2 6197198 210 21 1
2 5977127 514  0 1
2 6770649 514  0 1
2 6946243 435  2 1
2 6117879 514  2 1
2 7534903 549  0 1
2 7238470 435  0 1
2 7122357 435  1 1
2 7452875 514  0 1
2 6195941  49  1 .
2 7121069  54  0 .
2 7220733 514  0 1
2 5952327 514  0 1
2 7087620 514  0 1
2 6900322 546  0 1
2 7244743 514  1 1
2 7241770 514  0 1
2 6566369 514  1 1
3 7439253 514  0 1
3 7612087 514  0 1
3 7232834 514  7 1
3 7232833 514  4 1
4 6753158 435  0 1
4 6753151 435  1 1
4 6242175 435  1 1
5 7063943 435  3 1
5 6492160 435  2 1
5 6946546 530  1 1
5 6291650 530 10 1
5 6342588 530  4 1
5 7074557 435  0 1
5 6489123 435  2 1
5 6140471 530  6 1
5 6492497 530  3 1
5 6225447 530  6 1
5 6180336 435  6 1
5 6827925 424  0 1
6 7241863 530  0 1
6 5874298 435  2 1
6 6362231 514  3 1
6 6156539 435  0 1
6 6534289 435  0 1
6 5763494 514  3 1
6 6617358 514  0 1
6 6383527 424 24 1
6 6342532 514  2 1
6 6432656 435  0 1
6 6796967 604  0 1
6 6211244 514  0 1
6 6031003 514  1 1
6 7262280 530  1 1
6 7112595 514 10 1
6 6521667 514  0 1
6 5688764 424  0 1
6 5674846 514  0 1
6 6660753 514 18 1
end
Above is an example of my data. I want to get an average of the cites5yr_tar observations, given that overlap == 1. What I've tried is this (for deal_id = 1)

Code:
egen overlapqual5 = mean(cites5yr_tar) if deal_id == 1 & overlap == 1
This does give me what I want. However, instead of creating a seperate egen for each deal_id (I have over 200 of them in my whole dataset), I wanted to just replace it. I tried

Code:
replace overlapqual5 = mean(cites5yr_tar) if deal_id == 2 & overlap == 1
But got the unknown function mean () error.

Is there any way to work around this in stata, without creating a seperate egen for each variable?

Thanks,
Chris
Viewing all 65454 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>