cmp for instrumental multinomial probit when there are many categories

March 22, 2019, 4:58 pm

≫ Next: Unclear Seasonality & Strange Structural Break Test

≪ Previous: why i can't convert from yyyy/mm/dd to time

Dependent variable Y1 has three categories 1, 2, 3

Dependent variable Y2 has four categories 1, 2, 3, 4

Y2 is simply created by dividing "3" category in Y1 into 3 and 4.

The code with Y1

Code:

cmp (Y1=X C, iia) (X=Z C) if C<100, ind($cmp_mprobit $cmp_cont) vce(cluster D)

ran quickly without error, whereas the following code

Code:

cmp (Y2=X C, iia) (X=Z C) if C<100, ind($cmp_mprobit $cmp_cont) vce(cluster D)

throws me the following:

Fitting full model.
Likelihoods for 522614 observations involve cumulative normal distributions above dimension 2.
Using ghk2() to simulate them. Settings:
Sequence type = halton
Number of draws per observation = 1446
Include antithetic draws = no
Scramble = no
Prime bases = 2 3
Each observation gets different draws, so changing the order of observations in the data set would change the results.

and it is taking forever to run.

I have a few questions.

First, when it says "Likelihoods for 522614 observations involve cumulative normal distributions above dimension 2", is it saying "above dimension 2" because Y2 has 4 categories? So if I don't want to spend this long time for running one regression, should I stick to at most 3 categories, like Y1 ?

Second, it says "Each observation gets different draws, so changing the order of observations in the data set would change the results." Then is it safe to believe the regression results?

Third, I read this from help cmp.

"If the estimation problem requires the GHK algorithm (see above), change the number of draws per observation in the simulation sequence using the ghkdraws() option. By default, cmp uses twice the square root of the number of observations for which the GHK algorithm is needed, i.e., the number of observations that are censored in at least three equations. Raising simulation accuracy by increasing the number of draws is sometimes necessary for convergence and can even speed it by improving search precision. On the other hand, especially when the number of observations is high, convergence can be achieved, at some loss in precision with remarkably few draws per observations--as few as 5 when the sample size is 10,000 (Cappellari and Jenkins 2003). And taking more draws can also greatly extend execution time."

How can I reconcile

Sentence 1 "increasing the number of draws is sometimes necessary for convergence and can even speed it by improving search precision."

and

Sentence 2 "On the other hand, especially when the number of observations is high, convergence can be achieved, at some loss in precision, with remarkably few draws per observations (...) And taking more draws can also greatly extend execution time."

?

Sentence 1 is saying increase # in ghkdraws(#) to speed it up, and Sentence 2 is saying decrease it. Can I reconcile these two as "When N is big, choose low #, when N is small, choose high #"?

Also, as the guideline says, when 5 is enough for 10,000 observations, then will ghkdraws(5) also be enough for my 522614 observations? If yes, why is cmp using 1446 draws per observations by default?

↧

Unclear Seasonality & Strange Structural Break Test

March 22, 2019, 5:49 pm

≫ Next: How to calculate/obtain the standard errors of the residuals using FamaMacBeth with asreg fmb for Shanken Correction (1992)

≪ Previous: cmp for instrumental multinomial probit when there are many categories

I'm getting odd results with time-series data and I'm trying to interpret them to find seasonality. Using latest Stata, Win 10.

I'm working with a dataset using monthly count data with 17 years' worth of data on number of prisoners in a state prison system and a univariate time series. There is a positive upward trend for 15 years and then it decreases rapidly. I've attached this below.

My problem is this - I've concluded that using a differenced correlogram in tandem with a differenced periodogram is a good method for my data to find seasonal highs in prisoner counts. I've attached both below. The correlogram seems to indicate that I have seasonality at 1 year, 1.5 years, 2 years, and further harmonics on this 6 month cycle (2.5, 3, etc). The periodogram (I've added xlines) also seems to indicate this cycle as well at 1/0.83 , 1/.166 , 1/.25 , 1/.33 , etc, but this periodogram also has a strange dip at .22, .31, and .48.
Syntax used: -pergram D.docaverage- and -ac D.docaverage-

Now I'm fairly new to these analyses so maybe this isn't an issue but I'm not sure how to interpret those dips in the periodogram. What do they mean? Overall these two seem to indicate seasonality, but it isn't as clear as I would like

Secondary question- When I run the function -estat sbknown- and pick a date (any date) in my time series it invariably finds a structural break (significant result), even at points in the data where the function -estat sbsingle- does not. I am running this function on non-differenced data. Do these functions work differently or use different math? What could explain this result?

Array
Array
Array

↧

How to calculate/obtain the standard errors of the residuals using FamaMacBeth with asreg fmb for Shanken Correction (1992)

March 23, 2019, 6:20 am

≫ Next: Time Fixed effects

≪ Previous: Unclear Seasonality & Strange Structural Break Test

Hello everyone,

I am currently trying to implement the Shanken (1992) Correction for my dataset after running asreg ,fmb.

I somehow fail to obtain the standard errors of the residuals - is there any comand or way to obtain these?

My dataset is structured the following way (using random numbers, simplified):

Key	ID	u	CAP	CPI	GDP
1001	1	-.1154	.0025	.0084	-.0032
1002	2	-.1154	.0025	.0084	-.0032
..	..	..	..	..	..
1267012	12	.0232	-.302	.0122	.0032

ID is basically the number of the month of the given year.

I am running this loop to estimate the returns of the next 12 months (column u) using the identical betas for CAP, CPI & GDP I predicted in my stage 1 regression for each month of the given year.

scalar COUNTER = 12
scalar COUNTER2 = 0

//Using a counter from 1 to 8, since I am performing the regression for a period of 8 years
forval i = 1/8 {

use "C:\Stata_Data\Stage2.dta", clear

drop if ID > COUNTER
drop if ID <= COUNTER2

asreg u CAP CPI GDP, fmb
matrix b = e(b)
scalar A = el(b,1,1)
scalar B = el(b,1,2)
scalar C = el(b,1,3)

use "C:\Stata_Data\FMB.dta", clear
replace b_b_REX = A if ID ==`i'
replace b_b_CPI = B if ID ==`i'
replace b_b_TS = C if ID ==`i'
save "C:\Stata_Data\FMB.dta", replace

scalar COUNTER = COUNTER +12
scalar COUNTER2 = COUNTER2 +12
}

In a next step, as already mentioned, I would like to include the Shanken Correction. Since I haven't found any code or tool which can help me calculating the Shanken Correction, I am currently working on my own way to make it work. Thus, I need to obtain the standard errors of the residuals. Can anyone help me out here?

If anyone has a working solution for the Shanken Correction, I would highly appreciate some further help.

Best regards and thank you in advance,

Tobias

↧

Time Fixed effects

March 23, 2019, 6:35 am

≫ Next: Problem with delimit

≪ Previous: How to calculate/obtain the standard errors of the residuals using FamaMacBeth with asreg fmb for Shanken Correction (1992)

Hello everyone,
Just running a panel data regression, and the results for the Hausman test prove that Fixed effects is the best regression to run. The results I obtain are significant and seem to make sense considering the variables, however when I then test using testparm in order to see if I have to use time-fixed effects I obtain a result of Prob > F = 0.0011, which from my understanding means I have to use a i.Year variable in my regression, however when doing so my results for the fixed effect regression vary massively and don't seem to make any sense (coefficients that were ebfore negative now become positive,etc…)?
I have observations from years 2010,2012,2014,2016.
I am not sure how to proceed or explain this?
Many thanks

↧

Problem with delimit

March 23, 2019, 7:12 am

≫ Next: Education Calculation!

≪ Previous: Time Fixed effects

Dear All,

I have a master file from where I invoke other auxiliary files, the code being:

Code:


* Definition of the directory
cd "MyDir"


* Set the log file and loading the data
cap log close
log using Log\21Mar19.smcl, replace
set more off


/* Running preliminary analysis and tables */

do 4-Estimations_Tables.do

Inside the do file

4-Estimations_Tables.do, I have routines for the estimations and the generation of tables. However, since the generation of tables requires a large code I use delimit. Specifically:

Code:


#delimit ;

esttab   slm_dW_pc_culture slm_dW_pc_culturepos slm_dW_pc_children
   slm_dW_control slm_dW_trust slm_dW_obedience slm_dW_respect
   using "Tables_dW\slm_dW.tex",
   replace star(* 0.10 ** 0.05 *** 0.01)
   plain nogaps depvars b(%9.3f)
   legend noabbrev style(tex) booktabs
   title("slm_dW" \label{slm_dW}) se
   substitute(\begin{table}[htbp]\centering
   \begin{table}[htbp]\centering\footnotesize{ \end{tabular} \end{tabular}}) ;
   
* Table sem_dW

esttab   sem_dW_pc_culture sem_dW_pc_culturepos sem_dW_pc_children
   sem_dW_control sem_dW_trust sem_dW_obedience sem_dW_respect
   using "Tables_dW\sem_dW.tex",
   replace star(* 0.10 ** 0.05 *** 0.01)
   plain nogaps depvars b(%9.3f)
   legend noabbrev style(tex) booktabs
   title("sem_dW" \label{sem_dW}) se
   substitute(\begin{table}[htbp]\centering
   \begin{table}[htbp]\centering\footnotesize{ \end{tabular} \end{tabular}}) ;
   
* Table sarar_dW

esttab   sarar_dW_pc_culture sarar_dW_pc_culturepos sarar_dW_pc_children
   sarar_dW_control sarar_dW_trust sarar_dW_obedience sarar_dW_respect
   using "Tables_dW\sarar_dW.tex",
   replace star(* 0.10 ** 0.05 *** 0.01)
   plain nogaps depvars b(%9.3f)
   legend noabbrev style(tex) booktabs
   title("sarar_dW" \label{sarar_dW}) se
   substitute(\begin{table}[htbp]\centering
   \begin{table}[htbp]\centering\footnotesize{ \end{tabular} \end{tabular}});

# delimiit cr

In the output window I can see the estimations. But when I go to the folder where the tables are supposed to be, I cannot find anything. Just the first table is created. I also use xml_tab to generate files readable in excel. In that case not even the first table is created. Which is the issue with that?

Thanks in advance

↧

Education Calculation!

March 23, 2019, 9:45 am

≫ Next: How to interpret non-standardised coefficients?

≪ Previous: Problem with delimit

Good Afternoon Everyone,

I wonder if someone can help me to find a right command.

I need to calculate network connection where two directors went to same university, graduated within 2 years.
Is there anyway I can calculate this? Below is sample of my data.

I am using the following command, but I am sure its not really calculating what I need:

sort directorid InstitutionName QualificationDate
egen c4=group(directorid InstitutionName QualificationDate Qualificationtype )

input double directorid str43 institutionname str4 qualificationdate str27 qualification
11110932958 "Adelaide University Union (AUU)" "1967" "Bachelor's Degree (Hons)"
11269183344 "Adelaide University Union (AUU)" "1968" "BSc (Hons)"
11111062958 "Adelaide University Union (AUU)" "1970" "BTech"
642136497 "Adelphi University" "1967" "MBA"
329278949 "Adelphi University" "1968" "MBA"
13611206873 "Administrative Staff College of India (ASC)" "2005" "Attended"
11764174309 "Administrative Staff College of India (ASC)" "2005" "Attended"
81520910783 "Administrative Staff College of India (ASC)" "2006" "Advanced Management Program"
13805177095 "Administrative Staff College of India (ASC)" "2008" "Training Program"
end

It will be great if someone can help me to find right command for this calculation.

Kind Regards
Jas

↧

How to interpret non-standardised coefficients?

March 23, 2019, 10:17 am

≫ Next: Should we winsorize the interaction variables ?

≪ Previous: Education Calculation!

Hi all,

I am looking at the effect of gender inequality (measured by GII which is a 0 to 1 scale, where higher values indicate greater inequality) on economic growth (measured by GDP per capita growth, as a %). As the two variables aren't measured in the same form I was wondering how I interpret the effect of GII on growth when looking at my coefficients, and also how to interpret the interaction term between GII and income (measured by natural log of GDP per capita) given I have interacted two variables measured differently?

Code:

xtreg Growth lagGII lagIncomeln GII_Income i.Year, fe robust

Fixed-effects (within) regression               Number of obs     =      2,276
Group variable: CountryID                       Number of groups  =        114

R-sq:                                           Obs per group:
     within  = 0.1394                                         min =         18
     between = 0.0410                                         avg =       20.0
     overall = 0.0255                                         max =         20

                                                F(22,113)         =      14.49
corr(u_i, Xb)  = -0.9338                        Prob > F          =     0.0000

                            (Std. Err. adjusted for 114 clusters in CountryID)
------------------------------------------------------------------------------
             |               Robust
      Growth |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      lagGII |  -50.08883   12.19639    -4.11   0.000    -74.25208   -25.92559
 lagIncomeln |  -7.186864   1.354488    -5.31   0.000     -9.87035   -4.503379
  GII_Income |   5.313213   1.206568     4.40   0.000     2.922784    7.703641
             |
        Year |
       1997  |  -.3565464   .3871003    -0.92   0.359    -1.123462    .4103691
       1998  |  -.9939671   .4270054    -2.33   0.022    -1.839942   -.1479923
       1999  |   -.847875   .4163987    -2.04   0.044    -1.672836    -.022914
       2000  |  -.0046979   .4400977    -0.01   0.992    -.8766109    .8672151
       2001  |  -.9246834   .4146138    -2.23   0.028    -1.746108   -.1032587
       2002  |     -.5211   .4825815    -1.08   0.283    -1.477181    .4349809
       2003  |   .2614466   .4714257     0.55   0.580    -.6725327    1.195426
       2004  |   1.514311     .45475     3.33   0.001     .6133698    2.415253
       2005  |   1.131297   .4277166     2.64   0.009      .283913     1.97868
       2006  |   2.066314   .4628423     4.46   0.000      1.14934    2.983288
       2007  |   2.169376   .4984575     4.35   0.000     1.181842    3.156911
       2008  |   .4962147   .5488308     0.90   0.368     -.591118    1.583548
       2009  |  -2.813979   .4936407    -5.70   0.000     -3.79197   -1.835987
       2010  |   1.726426   .4774053     3.62   0.000     .7805997    2.672252
       2011  |   1.158964   .6006279     1.93   0.056    -.0309881    2.348916
       2012  |   .9286711   .5992796     1.55   0.124    -.2586098    2.115952
       2013  |   .7256153   .6654279     1.09   0.278    -.5927175    2.043948
       2014  |   1.278339   .5844081     2.19   0.031     .1205209    2.436156
       2015  |   .8467109   .6811747     1.24   0.216     -.502819    2.196241
             |
       _cons |   69.04605    12.6388     5.46   0.000     44.00631    94.08578
-------------+----------------------------------------------------------------
     sigma_u |  5.5939725
     sigma_e |  3.3023023
         rho |  .74156902   (fraction of variance due to u_i)
------------------------------------------------------------------------------

Is there a way I can work this out or is it easier to standardise my coefficients, if so how do I do this?

Many thanks,

Hellie

↧

Should we winsorize the interaction variables ?

March 23, 2019, 10:43 am

≫ Next: Whether to include squared terms with OLS or not

≪ Previous: How to interpret non-standardised coefficients?

I have two variable, a and b
a is a normal vari
b is a dummy ( 0 and 1)
c = a*b
c is for interation

I winsor2 the variable a and b, but i dont winsor2 the variable c
is that ok ?

↧

Whether to include squared terms with OLS or not

March 23, 2019, 11:06 am

≫ Next: Advice on survival analysis setup

≪ Previous: Should we winsorize the interaction variables ?

Hi all,

I am running a regression on logged hourly pay and was wondering how to test whether I should include higher power terms of continuous variables or not. The continuous variables I have are both measured in years, experience and tenure. I have noticed it's common to include age & age squared to account for decreasing marginal returns, and was wondering if there's a way to check whether to include tenure and/or experience as squared terms too. I have plotted a two-way scatter graph between hourly & tenure then experience but can't see a clear relationship.

Thank you for your time, I really appreciate it!

↧

Advice on survival analysis setup

March 23, 2019, 11:16 am

≫ Next: mvprobit

≪ Previous: Whether to include squared terms with OLS or not

Hello everyone,

I’m hoping to solicit some advice from those of you that are familiar with survival analysis. I’m still a bit new to the subject, but would like to get my feet wet with some firm-level data. I’ve read through a couple of helpful resources, but am still a bit puzzled as to properly stset this particular set up. The closest example that I can find related to my current setup on Statalist is this thread (https://www.statalist.org/forums/for...re-using-stset).

Below there is a sample of my data. It’s confidential so it’s been heavily altered. The data a firm level data from Q1-2008 to Q3-2014. I have a firm id (firm), a quarterly variable (quarter), the employment count (emp), when the firm came into business (register_date), when the firm went out of business (termination_date), and the major industry (naics). I’ve removed the other covariates for simplicity.

Here’s what I’ve done – I’m not sure if this is right, but would appreciate any feedback!

Code:

gen failure = 0
replace failure = 1 if quarter == termination_date
 
bys firm (quarter) : replace failure = . if  failure[_n-1] == 1
bys firm (quarter) : replace failure = . if  failure[_n-1] == . & quarter >= termination_date
bys firm (quarter) : replace failure = . if  employment    == . & quarter < termination_date
 
 
stset quarter, failure(failure == 1)
replace _t = _t – 191

Here’s my data:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int(firm quarter) float employment str6 naics int(register_date termination_date)
1 192   . "54" 204 207
1 193   . "54" 204 207
1 194   . "54" 204 207
1 195   . "54" 204 207
1 196   . "54" 204 207
1 197   . "54" 204 207
1 198   . "54" 204 207
1 199   . "54" 204 207
1 200   . "54" 204 207
1 201   . "54" 204 207
1 202   . "54" 204 207
1 203   . "54" 204 207
1 204  38 "54" 204 207
1 205  55 "54" 204 207
1 206   6 "54" 204 207
1 207  66 "54" 204 207
1 208   . "54" 204 207
1 209   . "54" 204 207
1 210   . "54" 204 207
1 211   . "54" 204 207
1 212   . "54" 204 207
1 213   . "54" 204 207
1 214   . "54" 204 207
1 215   . "54" 204 207
1 216   . "54" 204 207
1 217   . "54" 204 207
1 218   . "54" 204 207
2 192   . "54" 204   .
2 193   . "54" 204   .
2 194   . "54" 204   .
2 195   . "54" 204   .
2 196   . "54" 204   .
2 197   . "54" 204   .
2 198   . "54" 204   .
2 199   . "54" 204   .
2 200   . "54" 204   .
2 201   . "54" 204   .
2 202   . "54" 204   .
2 203   . "54" 204   .
2 204  27 "54" 204   .
2 205  39 "54" 204   .
2 206  21 "54" 204   .
2 207  20 "54" 204   .
2 208  66 "54" 204   .
2 209   5 "54" 204   .
2 210   2 "54" 204   .
2 211  29 "54" 204   .
2 212  26 "54" 204   .
2 213  24 "54" 204   .
2 214   6 "54" 204   .
2 215  10 "54" 204   .
2 216  22 "54" 204   .
2 217   5 "54" 204   .
2 218  20 "54" 204   .
3 192   . "51" 204 215
3 193   . "51" 204 215
3 194   . "51" 204 215
3 195   . "51" 204 215
3 196   . "51" 204 215
3 197   . "51" 204 215
3 198   . "51" 204 215
3 199   . "51" 204 215
3 200   . "51" 204 215
3 201   . "51" 204 215
3 202   . "51" 204 215
3 203   . "51" 204 215
3 204 220 "51" 204 215
3 205 237 "51" 204 215
3 206 215 "51" 204 215
3 207 361 "51" 204 215
3 208 225 "51" 204 215
3 209 219 "51" 204 215
3 210 338 "51" 204 215
3 211 398 "51" 204 215
3 212 123 "51" 204 215
3 213  37 "51" 204 215
3 214  37 "51" 204 215
3 215   0 "51" 204 215
3 216   . "51" 204 215
3 217   . "51" 204 215
3 218   . "51" 204 215
4 192   . "53" 204   .
4 193   . "53" 204   .
4 194   . "53" 204   .
4 195   . "53" 204   .
4 196   . "53" 204   .
4 197   . "53" 204   .
4 198   . "53" 204   .
4 199   . "53" 204   .
4 200   . "53" 204   .
4 201   . "53" 204   .
4 202   . "53" 204   .
4 203   . "53" 204   .
4 204   6 "53" 204   .
4 205  15 "53" 204   .
4 206  35 "53" 204   .
4 207  34 "53" 204   .
4 208  49 "53" 204   .
4 209  31 "53" 204   .
4 210   7 "53" 204   .
end
format %tq quarter
format %tq register_date
format %tq termination_date

↧

mvprobit

March 23, 2019, 11:36 am

≫ Next: Interpreting the results from Loneway command

≪ Previous: Advice on survival analysis setup

Dear Statusers,
I’m writing for a question on the package mvprobit.
I’m estimating a mvprobit for 5 search actions (outcomes) for the unemployed to see the effect of individual characteristics and structural variables effectiveness on such actions. The output I got so far is in the attached pdf . First there is some descriptive stuff for the outcomes and covariates (the labels of the variables are in Italia, buti t does not matter for my issue). Second, there is the mvprobit estimation with 5 equations (Y1,…Y5, 1 equation for each unemployment job search action):

Y1 (0,1) =f(x)
Y2 (0,1) =f(x)
Y3 (0,1) =f(x)
Y4 (0,1) =f(x)
Y5 (0,1) =f(x)

Marginal effects on each outcome are obtained using posterior simulation (10000 as number of simulated coefficient vectors from the posterior distribution of the estimated model parameters) and are not shown in the pdf for the sake of brevity.

I’m interested in calculating the marginal effects for combinations of outcome (joint marginal effect, to see the joint effect of search actions, as some unemployed use combinations of actions) such as:

pr(Y1=1, Y2=1, Y3=1, Y4=1, Y5=1)
pr(Y1=0, Y2=1, Y3=1, Y4=1, Y5=1)
pr(Y1=0, Y2=0, Y3=1, Y4=1, Y5=1)
….
pr(Y1=1, Y2=0, Y3=1, Y4=1, Y5=1)
……
….
pr(Y1=0, Y2=0, Y3=0, Y4=0, Y5=1)

we won’t have the case of no search actions:
pr(Y1=0, Y2=0, Y3=0, Y4=0, Y5=0)

I read in the Stata manual, your help file and SJ article that it is easy to calculate these marginal effects for M = 3, but I was not able to find a way to calculate them for M > 3, that is my case, with 5 equations.
Do you have some suggestions on this?

Thank you very much in advance,
Chiara

↧

Interpreting the results from Loneway command

March 23, 2019, 11:39 am

≫ Next: Help looping with levelsof and egen...

≪ Previous: mvprobit

Hi,

I was hoping someone could help me.
I ran the command
Loneway recycling acode
Where recycling is the recycling rate within a local authority, and acode is the code for each local authority.
Below are my results. What does it mean by 'intraclass correlation' here? Am I correct in thinking around two thirds of the variation in recycling rates comes from between local authorities; Is there is a high correlation of recycling rates due to local authority specific variation.

Thank You

One-way Analysis of Variance for recycling:

Number of obs = 6,350
R-squared = 0.6220

Source SS df MS F Prob > F
-------------------------------------------------------------------------
Between acode 103750.6 317 327.28895 31.31 0.0000
Within acode 63046.518 6,032 10.452009
-------------------------------------------------------------------------
Total 166797.12 6,349 26.2714

Intraclass Asy.
correlation S.E. [95% Conf. Interval]
------------------------------------------------
0.60287 0.02016 0.56336 0.64237

Estimated SD of acode effect 3.983316
Estimated SD within acode 3.232957
Est. reliability of a acode mean 0.96806
(evaluated at n=19.97)

↧

Help looping with levelsof and egen...

March 23, 2019, 11:55 am

≫ Next: Failed ramesy RESET test

≪ Previous: Interpreting the results from Loneway command

I need to create a binary indicator (weightmedian_binary) based on whether that variable (weight) is below or above its median value (weightmedian_priceq5) within each quintile of another variable (price_q5). Given how confusing this sounds, I have tried to illustrate with an example using the auto dataset. The issue is I need to generate multiple variables like this - beyond just weight - all based on whether they are below/above their median value within each quintile of price. I am trying to write a do file, but I am having difficulties since I do not know how to reference the median values from levelsof, have the loop correctly iterate quintile values, etc. Any help would be greatly appreciated. Thanks in advance!

Code:

sysuse auto
egen price_q5 = xtile(price), n(5)
egen weightmedian_priceq5 = median( weight ),  by( price_q5)
levelsof weightmedian_priceq5
gen weightmedian_binary = .
replace weightmedian_binary = 0 if price_q5==1 & weight<2640
replace weightmedian_binary = 1 if price_q5==1 & weight>=2640
replace weightmedian_binary = 0 if price_q5==2 & weight<2650
replace weightmedian_binary = 1 if price_q5==2 & weight>=2650
replace weightmedian_binary = 0 if price_q5==3 & weight<2670
replace weightmedian_binary = 1 if price_q5==3 & weight>=2670
replace weightmedian_binary = 0 if price_q5==4 & weight<3280
replace weightmedian_binary = 1 if price_q5==4 & weight>=3280
replace weightmedian_binary = 0 if price_q5==5 & weight<3890
replace weightmedian_binary = 1 if price_q5==5 & weight>=3890

↧

Failed ramesy RESET test

March 23, 2019, 12:54 pm

≫ Next: Adjusted/Within/Between R squared

≪ Previous: Help looping with levelsof and egen...

Hi,

Im fairly new to stata, i've just manually run a RESET test on my panel data and it failed. I understand that this is due to incorrect functional form of my linear model. Im not sure which variables in the model need to be changed. Im not sure if there is a way I can tell?

Thanks

↧

Adjusted/Within/Between R squared

March 23, 2019, 2:11 pm

≫ Next: Multinomial logit or -gsem- which is best for simultaneous choice

≪ Previous: Failed ramesy RESET test

Hi,

I would be very grateful if someone could confirm whether my understanding is correct.
I ran a panel regression, then asked stata for the adjusted r squared, can I assume this is the within r squared? My panels are local authorities.

"The adjusted r squared is 0.361 for dry recycling and 0.699 for compost, telling us only 36.1% of the variation in dry recycling rates is explained by the independent variables, and 69.9% of the variation in compost recycling rates."

Is it the variation in dry recycling rates, or the variation within local authorities in dry recycling rates?

Thank you in advance!
Darcy

↧

Multinomial logit or -gsem- which is best for simultaneous choice

March 23, 2019, 4:52 pm

≫ Next: How to use hp filter on different groups

≪ Previous: Adjusted/Within/Between R squared

Hello,

I am looking for some suggestions to choose an appropriate model for my research. I am trying to estimate the probability of using different job search method for unemployed individuals. I have five different outcome variables (search channels) that individual used to find jobs: (1) contacted potential employer(s) directly (yes/no), (2) through friend(s)/relative(s) (Yes/no), (3) placed or answered newspaper ad(s) (Yes/no), (4) consulted with employment agency (Yes/no), (5) searched the Internet. In the right-hand side, I am using demographic variables (Age, Sex, Education, Ethnicity) and other neighborhood characteristics as predictive variables (IV). My data is longitudinal, and I will estimate the model for both cross-sectional and panel setup.

I have few options to estimate the model. First, I can categorize the outcomes into two variables: informal networks ( friends and relatives) and formal networks (Internet or other institutional methods) and use a simple logit/Probit model. But I am also interested in examining the probability of using different formal methods as well, like Internet versus employment agency. In that case, I can use the multinomial Logit/Probit model. But the problem of Multinomial logit/Probit is that it assumes the individual will select only one alternative. But in my case, individuals used three or four methods simultaneously. If I restrict the sample on individuals who used only one method, I lose more than half of my sample.

Second, I can use -gsem- to estimate the model. The advantage of using –gsem- is that it would allow me for five separate but correlated binary outcome variables and also give me separate coefficient for each DV. But I am not sure whether -gsem- is the only available option or best option to solve my problem. Additionally, I was wondering whether I could use multivariate logit model as an alternative.

Any suggestions and advice will be greatly appreciated.

↧

How to use hp filter on different groups

March 23, 2019, 6:52 pm

≫ Next: hierarchical random effect meta-analysis

≪ Previous: Multinomial logit or -gsem- which is best for simultaneous choice

I have GDP data for each state from the last 10 years, and I want to use hp filter on each state's time series, suppose "state" is the variable for state name, and "year" is the time, how to use "tsfilter hp" command on this? Thanks in advance!

↧

hierarchical random effect meta-analysis

March 24, 2019, 10:04 am

≫ Next: Creating daily average and hourly levels from hourly data

≪ Previous: How to use hp filter on different groups

Hi everyone,

I am doing a meta-analysis and I want to use hierarchical random effect meta-analysis. is there any tutorial on how to do it on stata? or at least do you know a syntax for it?

↧

Creating daily average and hourly levels from hourly data

March 24, 2019, 10:24 am

≫ Next: How much computer specs do I have to analyze large data for fixed effect

≪ Previous: hierarchical random effect meta-analysis

Hi everyone!

I'm new to Stata and I would like to have some of your advice!

I have gathered hourly data of pollution emissions from 7 different monitoring stations in the city of Paris, France. I have 24 observations per day for each polluant, from each monitoring station, one per hour of the day, from 01/10/2009 to 01/10/2019. Time is my independent variable. My data set is like shown below:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double DateTime int(no2_pa13 pm25_aut)
1.5471648e+12  81 105
1.5471684e+12  86  99
 1.547172e+12  87  91
1.5471756e+12  89  86
1.5471792e+12  87  86
1.5471828e+12  83  87
1.5471864e+12  87  89
  1.54719e+12  94  95
1.5471936e+12 102  98
1.5471972e+12 114  97
1.5472008e+12 127 102
1.5472044e+12 145 114
 1.547208e+12 166 126
1.5472116e+12 161 136
1.5472152e+12 167 141
1.5472188e+12 147 137
1.5472224e+12 119 131
 1.547226e+12  99 129
1.5472296e+12 116 145
1.5472332e+12 128 162
1.5472368e+12 124 159
1.5472404e+12 126 143
 1.547244e+12 133 143
1.5472476e+12 152 161
1.5472512e+12 131 169
1.5472548e+12 114 153
1.5472584e+12 110 140
 1.547262e+12 105 138
1.5472656e+12  93 146
1.5472692e+12  88 150
1.5472728e+12  83 149
1.5472764e+12  75 137
  1.54728e+12  67 127
1.5472836e+12  65 126
1.5472872e+12  78 128
1.5472908e+12  97 134
1.5472944e+12 104 129
 1.547298e+12  99 127
1.5473016e+12  97 126
1.5473052e+12 104 123
1.5473088e+12 110 120
1.5473124e+12 116 113
 1.547316e+12 116 112
1.5473196e+12 115 110
1.5473232e+12 117  97
1.5473268e+12 115  89
1.5473304e+12 104  78
 1.547334e+12  93  79
1.5473376e+12  83  92
1.5473412e+12  75  91
1.5473448e+12  73  88
1.5473484e+12  69  73
 1.547352e+12  68  67
1.5473556e+12  75  66
1.5473592e+12  83  67
1.5473628e+12  82  61
1.5473664e+12  94  59
  1.54737e+12   .  58
1.5473736e+12   .  59
1.5473772e+12  72  57
1.5473808e+12  66  49
1.5473844e+12  65  45
 1.547388e+12  73  43
1.5473916e+12  81  39
1.5473952e+12  76  36
1.5473988e+12  78  34
1.5474024e+12  77  31
 1.547406e+12  77  28
1.5474096e+12  72  27
1.5474132e+12  56  24
1.5474168e+12  39  22
1.5474204e+12  32  19
 1.547424e+12  32  16
1.5474276e+12  35  12
1.5474312e+12  40  13
1.5474348e+12  40  15
1.5474384e+12  45  17
 1.547442e+12  69  17
1.5474456e+12  83  25
1.5474492e+12  95  28
1.5474528e+12  81  26
1.5474564e+12  77  26
  1.54746e+12  67  29
1.5474636e+12  65  31
1.5474672e+12  63  36
1.5474708e+12  67  38
1.5474744e+12  71  40
 1.547478e+12  75  42
1.5474816e+12  81  33
1.5474852e+12  83  30
1.5474888e+12  77  32
1.5474924e+12  58  31
 1.547496e+12  56  34
1.5474996e+12  53  36
1.5475032e+12  49  36
1.5475068e+12  54  34
1.5475104e+12  40  28
 1.547514e+12  23  22
1.5475176e+12  27  19
1.5475212e+12  30  16
end
format %tcMonth_dd,_CCYY_HH:MM:SS DateTime

The format of the date is %tcMonth_dd,_CCYY_HH:MM:SS.
For instance, no2 is the name of the pollutant, pa13 is the monitoring station's name.

From that, I would like to create two new variables. I would like to have average daily pollution levels, for each polluant, across all 7 stations, between 01/10/2009 and 01/10/2019. I would also like to have pollution levels across the hours of the day, for each pollutant, across all 7 stations. How can I create these variables from my data set?

Thanks for your help!

Guillaume

↧

How much computer specs do I have to analyze large data for fixed effect

March 24, 2019, 11:25 am

≫ Next: Generate a row rather than Column

≪ Previous: Creating daily average and hourly levels from hourly data

Hi
I have very large daily data.
The number of rows is 1 million and the number of columns is 50.
However, I anticipate that the number of columns will increase to approximately 5,000 by adding daily dummy variables, weekly dummy variables, and weekly x city dummy variables.
I would like to use this data to analyze fixed effects.
My computer spec has 4 core CPU and 8 gigabytes of RAM.

Results of two experiments in STAT 13 SE version
The computer stopped for 8 hours and 10 hours, respectively.

I do not know what to do, since this is the first time I have to turn around this big data.
Is there a good way?
I am willing to add RAM if necessary.

Thanks for reading
Thanks for any advice.

↧