Double Hurdle Poisson

November 20, 2018, 9:52 am

≫ Next: Inclusion of a cluster-level fixed effect in cross-sectional data

≪ Previous: Getting parameters from looping maximize()

I am looking to estimate a double hurdle model where the dependent variable is Poisson distributed. I cannot find an existing command to estimate this double hurdle Poisson model, but was wondering if a user written program exists, or if an existing command can be modified with an option I missed.

From what I understand, there are commands for double hurdle and single hurdle Poisson, but none for a double hurdle Poisson. Please correct me if I am mistaken, or if these codes can be modified to handle either a second hurdle or a Poisson dist. dependent variable.
dblhurdle: allows for double hurdle but assumes a normally distributed dependent variable.
churdle: allows for a single hurdle model with Poisson distributed dependent variable, but no option for second hurdle.

Thank you,
Anders

↧

Inclusion of a cluster-level fixed effect in cross-sectional data

November 20, 2018, 12:18 pm

≫ Next: Dynamic Factor Model

≪ Previous: Double Hurdle Poisson

Hi All,

I am trying to duplicate the Deaton (1988, 1989, 1990, 1992 and 1997) method of spatial variation using household level survey data (NIDS) to estimate price elasticities of demand for alcohol. However I am struggling to understand which STATA commands would be best. I have explored the aids command which is only useful for the calculation of an elasticity of more than one product. I was hoping someone could give me advice on the steps detailed below (specifically how a cluster-level fixed effect can be brought into a cross-sectional study):

Step 1: check for quality effects by regressing the log of unit value on total household expenditure in the form seen below.Array
Step 2: estimate a “demand” equation seen in the form seen in image 2.
Array
whc is the budget share of alcohol in total household expenditure in household h in cluster c. 𝒁𝒉𝒄 is a vector of household characteristics, 𝑓 is a cluster-*‐level fixed effect and 𝑢 is the error.

Any advice would be greatly appreciated.
Regards
Kelsey

↧

Dynamic Factor Model

November 20, 2018, 1:19 pm

≫ Next: Inspecting points on plots

≪ Previous: Inclusion of a cluster-level fixed effect in cross-sectional data

I'm trying to build a dynamic factor model and I came across with this presentation, which shows how to estimate a nowcasting model with mixed frequencies.

I'd highly appreciate if anyone could help me with a number of doubts (code and data attached):

1. In this presentation and other papers, it is said that the estimation is done in two steps (get the factor and then regress gdp against the factor).
Does this mean in Stata that I should first estimate my model (dfactor), get the unobserved factor (predict factor) and then regress GDP against the factor?

2. If I have data until July 2018, how can I map this into a GDP forecast of 3Q 2018? Should I regress GDP against moving averages of the factor?

3. What if I have data until July 2018 on all variables and industrial production for August is released: is it possible to update the forecast? how?

Thanks a lot for your help!

I'm attaching a very basic code and data for the Mexican economy as an example.

↧

Inspecting points on plots

November 20, 2018, 2:27 pm

≫ Next: Do I need to include control variables when using a matched sample

≪ Previous: Dynamic Factor Model

This is a trivial question, but I have not been able to find an answer anywhere (I apologize if this has already been answered elsewhere and I failed to find it):

Is it possible to extract information about a single data point of interest from a graph? E.g. let's say I see an outlier on a simple twoway scatter; is there an easy way to review the x & y coordinates of that one particular point?
As many of you probably know, this very straightforward in, say, Matlab (you can just hover your cursor over a point and its coordinate appear)...but I can't seem to be able to do that in Stata.

Thanks in advance.
Jakub

↧

Do I need to include control variables when using a matched sample

November 20, 2018, 2:34 pm

≫ Next: Setting calipers based on logit of propensity score when using teffects psmatch

≪ Previous: Inspecting points on plots

I would like to know if I need to include the variables that I matched the companies as control variables. In my data, I created a matched data set based on firm size and industry.
Also, I have seem some studies that did not include control variables at all after they created the matched sample. Is that correct?

Thank you

↧

Setting calipers based on logit of propensity score when using teffects psmatch

November 20, 2018, 2:45 pm

≫ Next: Difference between using only dummies and only one categorical variable

≪ Previous: Do I need to include control variables when using a matched sample

Hello all,

I am new to teffects psmatch and attempting to set caliper width for matching. I would like to set the caliper width equal to 0.2 of the standard deviation of the logit of the propensity score as has been suggested by prior literature (Austin, 2011) - https://www.ncbi.nlm.nih.gov/pubmed/20925139

In SAS, this is fairly straightforward, as when using the PSMATCH procedure in SAS, if I set the CALIPER option to 0.2 - this specifies that for a match to be made, the difference in the logits of the propensity scores for pairs of individuals from the two groups must be less than or equal to 0.2 times the pooled estimate of the common standard deviation of the logits of the propensity scores.

However, when using teffects psmatch in Stata, my understanding from the Stata guidebook is that if I set the caliper option to 0.2 - this specifies that for a match to be made, the difference in the propensity scores for pairs of individuals from the two groups must be less than or equal to 0.2 times the propensity score.

Thus, I am not sure how to replicate the caliper definition proposed by Austin et al or any other threshold based on the logit of the propensity score in Stata and would appreciate your insight.

Thanks,
Tim

↧

Difference between using only dummies and only one categorical variable

November 20, 2018, 2:50 pm

≫ Next: Generating a differences in means variable

≪ Previous: Setting calipers based on logit of propensity score when using teffects psmatch

Hello,

I am a little bit confused over my model. I am regressing log wages on being an Immigrant or not. I have splitted the Immigrant Population into different arrival waves. Now, I dont know whether I have to use dummies for each immigrant arrival wave or can I used just the categorical variable with the values of immigrant arrival waves and natives? Is there a difference between these two following models:

1. Model: Variable arrival has native (=9999) as the reference group

Code:

svy: regress lnhourlyw_w i.ib9999.arrival if year==2004
(running regress on estimation sample)

Survey: Linear regression

Number of strata   =         1                  Number of obs     =    10,726
Number of PSUs     =    10,726                  Population size   =    1,317,293
Design df         =    10,725
F(   6,  10720)   =    99.52
Prob > F          =    0.0000
R-squared         =    0.0279

    
Linearized
lnhourlyw_w       Coef.   Std. Err.      t    P>t     [95% Conf.    Interval]
    
arrival
pre 1980    -.1686351   .0151419   -11.14   0.000     -.198316    -.1389543
1980-84    -.1678049   .0202635    -8.28   0.000     -.207525    -.1280847
1985-89    -.2158353   .0165672   -13.03   0.000    -.2483101    -.1833604
1990-94    -.2542113   .0122076   -20.82   0.000    -.2781405    -.2302822
1995-99    -.1508089   .0222109    -6.79   0.000    -.1943463    -.1072715
2000-04    -.0889885   .0228124    -3.90   0.000     -.133705    -.044272

_cons    3.689774   .0057737   639.07   0.000     3.678457    3.701092

2. Model: Making Dummies for each Immigrant arrival wave from the Variable -arrival- such that the intercept is the native reference group

Code:

svy: regress lnhourlyw_w i.arvpre1980    i.arv1980 i.arv1985 i.arv1990    i.arv1995    i.arv2000 if year==2004
(running regress on estimation sample)

Survey: Linear regression

Number of strata   =         1    Number of obs     =    10,726
Number of PSUs     =    10,726    Population size   =    1,317,293
    Design df         =    10,725
    F(   6,  10720)   =    99.52
    Prob > F          =    0.0000
    R-squared         =    0.0279

        
Linearized
lnhourlyw_w       Coef.   Std. Err.    t    P>t     [95% Conf.    Interval]
        
1.arvpre1980    .0406217   .0139277    2.92   0.004     .0133207    .0679226
1.arv1980    .0414519   .0174006    2.38   0.017     .0073435    .0755603
1.arv1985   -.0065785   .0148694    -0.44   0.658    -.0357253    .0225684
1.arv1990   -.0449545   .0120761    -3.72   0.000    -.0686259    -.0212832
1.arv1995    .0584479   .0187726    3.11   0.002     .0216502    .0952457
1.arv2000    .1202683   .0192005    6.26   0.000     .0826317    .1579048
_cons    3.480518   .0087417    398.15   0.000     3.463382    3.497653

I see that the coefficients are different, but I don't see why since the reference group in both are natives.

↧

Generating a differences in means variable

November 20, 2018, 4:25 pm

≫ Next: Duplicating observations for particular years based on a start and end year

≪ Previous: Difference between using only dummies and only one categorical variable

Greetings,

I'm running Stata 15.1 on OSX and working with longitudinal data. I'd like to measure the gap in ideological distance between Democratic and Republican Presidential candidate voters over time. Ideology is scored along a 7 point scale, and party vote is a dummy variable where 1=voted for Democratic Presidential Candidate, and 0=voted for the Republican candidate. I attempted the following:

Code:

egen gopprez_ideo=mean(ideo7) if party_vote==0, by(year)

Code:

egen demprez_ideo=mean(ideo7) if party_vote==1, by(year)

The resulting variables look like this:
Array

I next attempted to subtract demprez_ideo from gopprez_ideo to create a new variable ( with the following syntax:

Code:

gen prezelect_ideodiff= gopprez_ideo-demprez_ideo

When I tabulated the resulting variable, however, 'no observations' is returned. What am I doing wrong? Thanks in advance!

Sample data:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float year byte(party_vote ideo7)
1972 0 5
1972 0 4
1972 1 5
1972 0 6
1972 0 4
1972 0 5
1972 1 1
1972 0 7
1972 0 5
1972 0 6
1972 0 4
1972 1 2
1972 1 2
1972 1 2
1972 0 6
1972 1 7
1972 1 2
1972 0 4
1972 0 6
1972 0 5
1972 0 6
1972 0 4
1972 0 6
1972 1 2
1972 0 4
1972 1 4
1972 0 4
1972 1 2
1972 1 5
1972 0 6
1972 0 4
1972 1 3
1972 1 2
1972 0 4
1972 0 6
1972 0 6
1972 1 3
1972 0 2
1972 1 3
1972 0 5
1972 0 2
1972 1 2
1972 0 4
1972 1 4
1972 0 4
1972 0 4
1972 0 5
1972 1 5
1972 0 6
1972 0 5
1972 0 5
1972 1 2
1972 0 5
1972 1 4
1972 0 5
1972 0 1
1972 0 4
1972 1 2
1972 0 5
1972 0 3
1972 1 4
1972 0 5
1972 0 6
1972 1 3
1972 0 5
1972 0 4
1972 0 6
1972 1 3
1972 0 7
1972 0 6
1972 0 5
1972 0 6
1972 1 3
1972 0 4
1972 1 3
1972 0 5
1972 0 5
1972 0 4
1972 0 5
1972 0 3
1972 0 5
1972 0 4
1972 0 3
1972 0 5
1972 1 3
1972 0 6
1972 1 2
1972 0 7
1972 1 3
1972 0 3
1972 0 4
1972 0 4
1972 0 6
1972 0 6
1972 0 5
1972 0 4
1972 0 4
1972 0 7
1972 1 4
1972 0 4
end
label values party_vote VCF0704a
label def VCF0704a 0 "0. Did not vote; DK/NA if voted; refused to say if", modify
label def VCF0704a 1 "1. Democrat", modify
label values ideo7 VCF0803_
label def VCF0803_ 1 "1. Extremely liberal", modify
label def VCF0803_ 2 "2. Liberal", modify
label def VCF0803_ 3 "3. Slightly liberal", modify
label def VCF0803_ 4 "4. Moderate, middle of the road", modify
label def VCF0803_ 5 "5. Slightly conservative", modify
label def VCF0803_ 6 "6. Conservative", modify
label def VCF0803_ 7 "7. Extremely conservative", modify

↧

Duplicating observations for particular years based on a start and end year

November 20, 2018, 4:43 pm

≫ Next: Problem with bysort and tabulate using asdoc

≪ Previous: Generating a differences in means variable

Dear Statalist,

I have a dataset following values of coefficients of a large number of cities over time. Each city used a different coefficient during different periods of time. The structure of the data is following:

City ID	Start year	End year	Coefficient
1	2000	2001	2
1	2002	2010	3
2	2000	2003	4
2	2004	2009	5
2	2010	2010	4

However, I need to have a separate observation for each town and year to create a panel like this:

City ID	Year	Coefficient
1	2000	2
1	2001	2
1	2002	3
...	...	...
2	2000	4
2	2001	4
2	2002	4
...	...	...

Do you have any advice how to realize it?

Thank you,

Paulina

↧

Problem with bysort and tabulate using asdoc

November 20, 2018, 6:45 pm

≫ Next: Graphical display of Max Youden index

≪ Previous: Duplicating observations for particular years based on a start and end year

Hi all,

I wonder if its possible use asdoc with bysort: and tabulate example

Code:

sysuse auto.dta
bys foreign: asdoc tab rep78

Result its not that I expected

Code:

     Repair |
Record 1978 |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          2        2.90        2.90
          2 |          8       11.59       14.49
          3 |         30       43.48       57.97
          4 |         18       26.09       84.06
          5 |         11       15.94      100.00
------------+-----------------------------------
      Total |         69      100.00
Click to Open File:  Myfile.doc

I grateful any coments
Thanks

↧

Graphical display of Max Youden index

November 20, 2018, 8:00 pm

≫ Next: Comparing each string observation against a list of string observations

≪ Previous: Problem with bysort and tabulate using asdoc

Dear Experts,
I need your help to obtain a code to generate simmilar graph for graphical representation of the sensitivity, specificity, and the Youden index at different ratios. Simmilar to what is posted here
How would it be possible to have the Youden index on the Y axis with the Ratio on the X axis and have a line that marks the highest Youden index area.
I much appreciate your help
Sincerely
Array

↧

Comparing each string observation against a list of string observations

November 20, 2018, 9:55 pm

≫ Next: how to local estimate from a regression

≪ Previous: Graphical display of Max Youden index

Dear all,

Suppose I have the following dataset.

Code:

clear
input str8 groupid str8 memberid str8 related_memberid
"A000" "B000" "B009" 
"A000" "B000" "B005"
"A000" "B000" "B006"
"A000" "B000" "B010"
"A000" "B002" "B010"
"A000" "B002" "B001"
"A000" "B002" "B023"
"A000" "B003" "B004"
"A000" "B003" "B016"
"A000" "B004" "B003"
"A000" "B004" "B015"
"A000" "B005" "B000"
"A000" "B005" "B015"
"A000" "B006" "B000"
"A000" "B006" "B016"
"A001" "B015" "B005"
"A001" "B018" "B116"
"A001" "B019" "B025"
"A001" "B017" "B042"
"A001" "B019" "B053"
"A002" "B031" "B045"
"A002" "B032" "B062"
"A002" "B033" "B086"
end

I'd like to compare each string observation of related_memberid against all string observations of memberid within the same group (identified by groupid) to see if they are matched.

For example, within group A000, I'd like to see if B009 in related_memberid matches with any observations in memberid (e.g. B000, B002, B003, B004, B005, B006).

Could anyone please help me with this ?

Thank you very much,

Vinh

↧

how to local estimate from a regression

November 20, 2018, 11:16 pm

≫ Next: Graph Slope of a Quadratic Regression

≪ Previous: Comparing each string observation against a list of string observations

Hi Stata users,

Please can anyone point to the direction of command line to set estimates of a regression to a local macro using the sample data below:

Code:

clear
set seed 100
set obs 100
gen y = 1 in 1/20
replace y = 0 in 21/100
gen x = runiform()
glm y x, fam(bin) link(log)

I know the below is not the right code but I want to achieve something that:

local estimate = coefficient
local lci = lower 95%CI
local uci = upper 95%CI

Thanks
Madu

↧

Graph Slope of a Quadratic Regression

November 20, 2018, 11:40 pm

≫ Next: Help needed on my idea!!!

≪ Previous: how to local estimate from a regression

Hi everyone,

Just curious to know if there a command in Stata which will plot for me the slope of the fitted line below?

Thanks. Array

These are the commands I used to get to the figure above:

gen hhsize2 = hhsize * hhsize
reg pscore hhsize hhsize2
predict yhat
twoway (scatter yhat hhsize)

↧

Help needed on my idea!!!

November 21, 2018, 12:29 pm

≫ Next: Please Help - Differentiating Graph Marker Size and Legend Marker Size

≪ Previous: Graph Slope of a Quadratic Regression

Hello, Im trying to figure out if i can use Diff in Diff estimator to estimate the effect of legalization of marijuana on crime rate between Colorado vs Wisconsin
would it be a viable estimator to estimate?

↧

Please Help - Differentiating Graph Marker Size and Legend Marker Size

November 21, 2018, 12:32 pm

≫ Next: Event study examining selection bias(?) in DD specification

≪ Previous: Help needed on my idea!!!

Hi Stata community,

I am trying to create a scatter plot in Stata 15.1 (on Linux, though I can work on Windows too) in which I am trying to make the legend marker size different from the marker size in the plot region (see attached).

For the plot command, I have:
(scatter price date if product == "XXX", mcolor(red%20) msymbol(triangle) msize(vsmall))
(scatter price date if product == "YYY", mcolor(blue%20) msymbol(circle) msize(vsmall))

And then for the legend option, I have:
legend(label(1 “XXX”) label(2 "YYY") size(vsmall) region(col(white) lstyle(none)) row(1))

It seems that when the change the “size” in the legend option, it also changes the size of the text in the legend. However, I only want to scale up the marker size in the legend (i.e. the red triangle and the blue circle right next to the legend texts) without doing the same for the markers in the plot region.

I’d appreciate if anyone could kindly offer any insights on how I can resolve the issue. Thank you!

↧

Event study examining selection bias(?) in DD specification

November 21, 2018, 1:17 pm

≫ Next: Inquire about "predict"

≪ Previous: Please Help - Differentiating Graph Marker Size and Legend Marker Size

I have a school district-year panel and am running a diff-in-diff (DD) looking at a policy change (a switch to four-day school weeks) that happens at the school district-level ("fourday" is my treatment dummy). Districts switch schedules at different years and always stay on that schedule after switching. I use year and district fixed effects in a DD specification (with standard errors clustered at the district level) to estimate the effect of four-day weeks on achievement outcomes (avg).

The code I've used for that specification is the following:

Code:

reghdfe avg fourday ${covariates}, absorb(year district_id) vce(cluster district_id)

Now, I am trying to perform robustness checks and have been struggling with what to do. I first perform a parallel trends event study specification for achievement outcomes using the following code:

Code:

reghdfe avg fourday_lead_3 fourday_lead_2 fourday_lead_1 fourday_time_0 fourday_lag_1 ///
              fourday_lag_2plus ${covariates}, absorb(year district_id) vce(cluster district_id)
test fourday_lead_3=fourday_lead_2=fourday_lead_1
test fourday_time_0=fourday_lag_1=fourday_lag_2plus

I interpret the point estimates as the "effect" of being x years pre ("lead")- or post ("lag")- fourday treatment (fourday_time_0 = adoption year) relative to never having been treated or being four or more years pre-receiving treatment.
Is it correct to use a F-test that equates the pre-treatment years to each other but not to zero because I want to allow for different levels in treatment without allowing for different trends in treatment?

Here is where I am most stuck: I want to check for selection into treatment (or other general threats to the original DD?) based on demographic variables such as the percent of students receiving free lunch (perfrl). I think it would invalidate the original DD if there were differential trends in perfrl leading up to treatment between the treatment and control districts. I think it would also be problematic if perfrl were changing after treatment, as that could be an indication that the districts exposed to four-day week are demographically different than the control districts (which I do not expect to be the case)? Could I use a similar event study specification as above to examine this question? Would a single significant point estimate be problematic or only a significant F-test? Would the F-test in this case be that all point estimates are equal to each other? Or should I test the equivalence of pre-treatment and post-treatment point estimates separately to allow for different trends pre- and post-treatment? Or should I do something else entirely?

Code:

reghdfe perfrl fourday_lead_3 fourday_lead_2 fourday_lead_1 fourday_time_0 fourday_lag_1 ///
           fourday_lag_2plus ${covariates}, absorb(year district_id) vce(cluster district_id)
test fourday_lead_3=fourday_lead_2=fourday_lead_1=fourday_time_0=fourday_lag_1=fourday_lag_2plus

Any advice would be greatly appreciated -- thanks!

↧

Inquire about "predict"

November 21, 2018, 2:04 pm

≫ Next: Multiple imputation with no complete cases and too many variables

≪ Previous: Event study examining selection bias(?) in DD specification

The dataset is mus03data.dta from the website

HTML Code:

http://cameron.econ.ucdavis.edu/musbook/mus.html

Code:

quietly regress ltotexp suppins phylim actlim totchr age female income, vce(robust)
rvfplot
predict uhat, residual

what's the effect of option "residual" here? Also, why are there 109 missing values generated while

Code:

predict  yhat, xb

doesn't generate missing values?

Many thanks in advance!

↧

Multiple imputation with no complete cases and too many variables

November 21, 2018, 2:37 pm

≫ Next: Reshape to wide

≪ Previous: Inquire about "predict"

Dear all,

Even though I have tried to find the solution to my problem, I still have the following challenge of multiple imputation. Also, my questions below are not directly related to the use of Stata but with statistics. However, I think this problem is something that some Stata users may face as well.

The objective of my project is to predict Y using a set of variables X. I don't care about the estimates but the prediction only.

Characteristics of my data:

My data is a panel with 217 individuals followed during 58 years.
The X variables are about 1500.
All the X variables have missing values. I don't have one single complete case.
Y also have missing values.
My Y variable has a two-year lag with respect to the X variables. That is, some X variables go up to 2017 whereas my Y variable goes up to 2015. I am interested in predicting (nowcasting) Y for 2016 and 2017. That is, I don't care about predicting accurately missing values of Y for years before 2015.
My X variables are on different scales. Some are continuous, some are shares or proportion of something else, some are densities, and some are changes over time. None of them is categorical.
I don't know any qualitative relationship among the X variables besides the fact I can group them in different topics like "exercise habit", "sleep habits", "nutritional habits", among many others.

Given this characteristics, I am trying to find the answer to the following questions but I have not found anything substantial yet and I was wondering if maybe you guys can give me some light on this.

It could be argued that all the X variables relate to each other, but that would imply doing multiple imputation of 1500 variables at the same time. Is it reasonable to do so? or, is it better to imput by topics?
Given that I don't know any theoretical relationship between the X variables, is there any statistical analysis that sheds some light on which variables should I impute together?
Given that I have too many variables, does it make sense to do multiple imputation by parts rather than all at once? If so, which variables should I impute first? the ones that have less missing values?

I know that it is my job to do proper research and select the correct imputation model for my project. So far, I have not found anything that deals with the two main characteristics of my dataset: [1] no complete cases and [2] too many variables. If you could point me out to any paper or relevant document, I would highly appreciate it.

Thank you so much,

Pablo.

↧

Reshape to wide

November 21, 2018, 2:51 pm

≫ Next: Forest plot of odd ratios in case control study (not derived from logistic regression)

≪ Previous: Multiple imputation with no complete cases and too many variables

I am trying to reshape the following table to wide format; but, I keep getting errors because, I think, my data structure is not what -reshape- is supposed to work with

Firm	Year	estimator
3D	2000	BEAR
3D	2000	GSAX
3D	2001	GSAX
Canon	2000	BEAR
Canon	2000	JPMORGAN
Canon	2000	GSAX

Now, I want this to be in this format:

Firm	Year	estimator1	estimator2	estimator3
3D	2000	Bear	GSAX
3D	2001	GSAX
Canon	2000	BEAR	JPMORGAN	GSAX

Could you please help me with the code?

Thanks,
Navid

↧