Propensity score matching with weight

January 17, 2018, 9:57 pm

≫ Next: Fleiss kappa or ICC for interrater agreement (multiple readers, dichotomous outcome) and correct stata comand

≪ Previous: cmp command with one endogenous variable but has more than one instruments

Hello.
I try to analyze my data using propensity score matching. So, I used "teffects psmatch" command for that, but I realized I cannot use analyzed weight (aw) for logit or probit.
With some search, someone recommends that using logit with aweight can be used by "glm, link(logit)". However, teffects psmatch command allows fweight only.
So, I confused that

1. Should I analyze my data without aweight? (I think it can be possible because two groups will be almost same after PSM without using weight, but not represent population.)
2. How can I do propensity score matching with aweight?

Before I write here, I've searched for a long time, but I couldn't find answers. Does someone help me how to solve my problem?
Thank you for reading. Have a nice day.

↧

Fleiss kappa or ICC for interrater agreement (multiple readers, dichotomous outcome) and correct stata comand

January 18, 2018, 12:16 am

≫ Next: Cloglog: backed up

≪ Previous: Propensity score matching with weight

106 units are all assessed by the same 5 readers. Units were judged either positive or negative (dichotomous outcome).

What is the best applied statistical test to look at interrater agreement?

1) Is ICC (two-way random effect model, single rater, agreement) usefull, or is that only to apply to continous or categorical data with >2 possible ratings?
2) Is Fleiss kappa the statistical test of choice, and if so,
a) is the correct stata command kappa pos neg (when data are organised in: Column 1; subject id, column 2; number of positive reads (pos), column 3; number of negative reads (neg))
b) Does this test allow that oner reader is the basis/gold standard (the one others should agree with)

Will be very grateful for your input

↧

Cloglog: backed up

January 18, 2018, 1:19 am

≫ Next: Predicting y after having orthogonalized x's

≪ Previous: Fleiss kappa or ICC for interrater agreement (multiple readers, dichotomous outcome) and correct stata comand

Hi everyone,

I am estimating a discrete hazard rate model, in particular a complementary log log (cloglog) for unemployment to employment transitions.

Whenever I add 3 or more covariates the estimation gets stuck in `backed up' iterations where the log likelihood doesn't change.

I guess this has to do with the fact that the likelihood function is relatively flat at the points where the algorithm is searching, but in any case, is this a normal feature of estimating such hazard rates (I'm new in this domain)?

Tom

↧

Predicting y after having orthogonalized x's

January 18, 2018, 8:03 am

≫ Next: xtdpdgmm and AR(2)

≪ Previous: Cloglog: backed up

Dear all,

I am using Stata 15. I orthogonalized two variables that were highly correlated using the following command:

Code:

orthog x1 x2, generate (orthx1 orthx2)

Then, I run a multiple regression adding other variables to the ones just orthogonalized.

Code:

regress y orthx1 orthx2 x3 x4

Now I am trying the predict the y using some predefined values for the x's. I have some values for x1 and x2 that I would like to use in order to predict the y but they are not in the orthogonal unit of measurement. How can I convert the values of x1 and x2 in the orthogonal version so that I can use the coefficients obtained from the regression output to predict the y? I tried to google it but I was not able to find any results.

Thank you very much for your help and time.

↧

xtdpdgmm and AR(2)

January 18, 2018, 9:50 am

≫ Next: Changes over time within a day and another if condition

≪ Previous: Predicting y after having orthogonalized x's

Dear Statalists,

I am estimating the following:

PHP Code:


xtdpdgmm eci l2.eci fdi_log ipp urb_rate biz rd pat charg, gmmiv(l2.eci ipp urb_rate biz rd pat charg, lagrange(1 3) collapse model(difference)) iv(l.fdi_log, difference model(difference)) twostep vce(robust) noserial

Testing for autocorrelation:

PHP Code:


Arellano-Bond test for autocorrelation of the first-differenced residuals
H0: no autocorrelation of order 1:     z =   -2.1280   Prob > |z|  =    0.0333
H0: no autocorrelation of order 2:     z =   -2.1967   Prob > |z|  =    0.0280
H0: no autocorrelation of order 3:     z =    0.3968   Prob > |z|  =    0.6915
H0: no autocorrelation of order 4:     z =    0.0921   Prob > |z|  =    0.9266

Since I have to reject the null that there is NO autocorrelation of second order, does this mean I have to use deeper lags(from lag 3 onwards) of endogenous variables or/and of the dependent variable as well?

From what I am reading in the literature I cannot understand if this problem exists only to difference GMM or to also to system GMM and the Ahn-Schmidt estimator as well.

Please can somebody refer me to literature with examples on the AS-estimator?

Thanks in advance for any answers!

↧

Changes over time within a day and another if condition

January 18, 2018, 10:24 am

≫ Next: How to generate R-squared after xtlogit using stata command

≪ Previous: xtdpdgmm and AR(2)

Hello,

I am new to Stata. I have a dataset with the variables price, Date (yyyy-mm-dd hh-mm-ss) and id.

Each id changes its prices a few times during a day. I need a new column with the Price changes (old price minus new price). I want these price changes only within one day and only for the respective id.

I am very glad for your help.

↧

How to generate R-squared after xtlogit using stata command

January 18, 2018, 10:40 am

≫ Next: svy and clustering

≪ Previous: Changes over time within a day and another if condition

Dear respected members,
Pls, is there any command for the purpose of generating R-squared after xtlogit using Stata?
Thank you in advance for your contribution
Adamu

↧

svy and clustering

January 18, 2018, 11:26 am

≫ Next: Help in Stata homework

≪ Previous: How to generate R-squared after xtlogit using stata command

Hello,

I'm working on a data set from a nationally-representative survey and am wondering if someone could shed some light on an issue I've encountered.

The subsample I'm interested in consists of twins, so I need to account for within-cluster correlations. This presents a problem with svy commands. I'm looking to use svy: logit after setting the PSU, weight, and stratum variable appropriate for the survey design for generalizability. Doing this, however, prevents me from also specifying cluster(twinpair). Is there a way to specify both the survey design and twin-level clustering?

Thank you in advance.

↧

Help in Stata homework

January 18, 2018, 11:30 am

≫ Next: CAR for multiple event days

≪ Previous: svy and clustering

Hey guys. I need your help.

We have a homework assignment for Stata software.

But i have no idea how to do that.

Could you please help me to write the script?

I have attached a screenshot with my homework.

Thanks.

Max

↧

CAR for multiple event days

January 18, 2018, 11:39 am

≫ Next: Datetime format problem

≪ Previous: Help in Stata homework

Good evening,

I am trying to calculate the Car for multiple event days ( in total I have 9 event days). I have found all the steps that I should follow if I have one event day, but I am get confused when I am trying to have all of them. My questions is how can I include all the event days? Thank you very much in advance

↧

Datetime format problem

January 18, 2018, 12:00 pm

≫ Next: Understanding the speedup from using parallel with synthetic cohort methods.

≪ Previous: CAR for multiple event days

Hello,

I have a variable date in the following format:

date
"2017-08-31 11:48:05"
"2017-08-31 11:48:05"
"2017-08-31 11:48:05"
"2017-08-31 11:48:05"
"2017-08-31 11:48:05"
"2017-08-31 11:48:05"

I want stata to recognize it as a date, so I have tried this:

gen dateobs = clock(date, "MD20Yhm")
format dateobs %tc

unfortunately, after the first command the error "xxxxx missing variables generated" appears.

Can anybody tell me what I can try?

↧

Understanding the speedup from using parallel with synthetic cohort methods.

January 18, 2018, 12:05 pm

≫ Next: Using slope dummy to test for significant effect difference between subperiods

≪ Previous: Datetime format problem

I am trying to optimize the speed of the user-written -synth_runner- command using 8-core StataMP 15.1 on a Mac with 8 cores and 16 GB of physical memory.

I ran a simulation where I varied the number of clusters from 1 to 8 and also performed a non-parallelized version of the analysis (all code at bottom).

Here are the results, where the timer # corresponds to the number of clusters. Timer 10 is the non-clustered version.

Code:

. timer list
   1:     43.59 /        1 =      43.5920
   2:     25.23 /        1 =      25.2340
   3:     20.99 /        1 =      20.9890
   4:     20.11 /        1 =      20.1050
   5:     19.37 /        1 =      19.3670
   6:     19.36 /        1 =      19.3550
   7:     20.06 /        1 =      20.0600
   8:     19.27 /        1 =      19.2720
  10:     77.37 /        1 =      77.3670

I am struggling to understand why the time does not increase very much after 3 cores. I also get similar results when I used the nested optimization option (though obviously all the times are longer).

Here's the code:

Code:

set more off
capture trace off
clear all
cls

cap drop pre_rmspe post_rmspe lead effect cigsale_synth
cap drop cigsale_scaled effect_scaled cigsale_scaled_synth D
cap program drop my_pred my_drop_units my_xperiod my_mspeperiod

program my_pred, rclass
    args tyear
    return local predictors "beer(`=`tyear'-4'(1)`=`tyear'-1') lnincome(`=`tyear'-4'(1)`=`tyear'-1')"
end

program my_drop_units
    args tunit
    if `tunit'==39 qui drop if inlist(state,21,38)
    if `tunit'==3 qui drop if state==21
end

program my_xperiod, rclass
    args tyear
    return local xperiod "`=`tyear'-12'(1)`=`tyear'-1'"
end

program my_mspeperiod, rclass
    args tyear
    return local mspeperiod "`=`tyear'-12'(1)`=`tyear'-1'"
end


timer clear
timer on 10

use smoking, clear
tsset state year

gen byte D = (state==3 & year>=1989) | (state==7 & year>=1988)

synth_runner cigsale retprice age15to24, d(D) pred_prog(my_pred) trends training_propr(`=13/18') ///
drop_units_prog(my_drop_units) xperiod_prog(my_xperiod) mspeperiod_prog(my_mspeperiod) deterministicoutput ///nested

effect_graphs
pval_graphs

timer off 10

forvalues p = 1(1)8 {

    timer on `p'

    parallel clean, all
    parallel setclusters `p'

    use smoking, clear
    tsset state year

    gen byte D = (state==3 & year>=1989) | (state==7 & year>=1988)

    synth_runner cigsale retprice age15to24, d(D) pred_prog(my_pred) trends training_propr(`=13/18') ///
    drop_units_prog(my_drop_units) xperiod_prog(my_xperiod) mspeperiod_prog(my_mspeperiod) parallel deterministicoutput ///nested

    effect_graphs
    pval_graphs

    timer off `p'
}


timer list

↧

Using slope dummy to test for significant effect difference between subperiods

January 18, 2018, 12:27 pm

≫ Next: fastreshape - more efficient implementation of reshape for big datasets

≪ Previous: Understanding the speedup from using parallel with synthetic cohort methods.

Hi, I'm doing panel regressions with growth rates but I want to try to visualise my problem with the invest2 dataset included in Stata.

I'm doing panel regressions (time/years: 1-20) in the form "xtgls invest market i.time, panels (hetero) corr(ar1)". I also want to look at different subperiods e.g. 1-10 and 11-20 and formal test whether e.g. the effect (beta_market) is significantly different between the two subperiods. If I understand correctly (from past statalist posts) I can do this by doing another regression for the full period (1-20) with an added subperiods dummy (0 for 1-10, 1 for 11-20) interaction term: "xtgls invest c.market##i.dummy i.time, panels(hetero) corr(ar1)". And then (if I understand correctly) if the beta for the interaction variable "c.market#i.dummy" (beta_dummy#c.market1 in output) is significant I can conclude that the beta from the subperiod 1-10 regression is different from the beta of the subperiod 11-20 regression.

The Stata commands for the example:

use http://www.stata-press.com/data/r12/invest2.dta
gen dummy = 0 if inrange(time,1,20)
replace dummy = 1 if inrange(time,11,20)
xtset company time
xtgls invest c.market##i.dummy i.time, panels(hetero) corr(ar1)
xtgls invest market i.time if inrange(time, 1, 10), panels (hetero) corr(ar1)
xtgls invest market i.time if inrange(time, 11, 20), panels (hetero) corr(ar1)

In this case I would conclude that:

for subperiod 1-10 b_market = 0.0825664 is significant at 1% and
for subperiod 11-20 b_market=0.1122994 is significant at 1% and/but

because beta_dummy#c.market1 is not significant (p-value=0.323) the difference in the slopes is not significant.
(Is this a correct interpretation?)

Another problem I have is, I also discovered that "beta_market + beta_dummy#c.market1" from: "xtgls invest c.market##i.dummy i.time, panels(hetero) corr(ar1)" is not equal to beta_market from the subperiod 11-20 regression: "xtgls invest market i.time if inrange(time, 11, 20), panels (hetero) corr(ar1)", although it is equal when i do all regressions without the options: "panels (hetero) corr(ar1)". But - in the first case with both options used - which betas for the two subperiods would i choose/write about im my study? The two seperate betas from "xtgls invest market i.time if inrange(time, 1, 10), panels(hetero) corr(ar1)" and "xtgls invest market i.time if inrange(time, 11, 20), panels(hetero) corr(ar1)" or (if i want to also conclude whether there is a significant difference for the two subperiod betas) only the two resulting (but in comparison to the separate subperiod regressions slightly different) betas (beta_market and beta_market + beta_dummy#c.market1) from "xtgls invest c.market##i.dummy i.time, panels(heteroskedastic) corr(ar1)"? Thank you for your help!

The Stata commands I used for the second problem:

use http://www.stata-press.com/data/r12/invest2.dta
gen dummy = 0 if inrange(time,1,20)
replace dummy = 1 if inrange(time,11,20)
xtset company time
xtgls invest c.market##i.dummy i.time, panels(hetero) corr(ar1)
xtgls invest market i.time if inrange(time, 1, 10), panels (hetero) corr(ar1)
xtgls invest market i.time if inrange(time, 11, 20), panels (hetero) corr(ar1)
di 0.0985778+0.0202642
// is not equal to beta_market from "xtgls invest market i.time if inrange(time, 11, 20), panels (hetero) corr(ar1)"

//same regressions without controlling for panels(hetero) corr(ar1):
xtgls invest c.market##i.dummy i.time
xtgls invest market i.time if inrange(time, 1, 10)
xtgls invest market i.time if inrange(time, 11, 20)
di 0.1060946+0.0858159
// is equal to beta_market from "xtgls invest market i.time if inrange(time, 11, 20)"

↧

fastreshape - more efficient implementation of reshape for big datasets

January 18, 2018, 2:02 pm

≫ Next: Bar chart for complete sample AND one subgroup

≪ Previous: Using slope dummy to test for significant effect difference between subperiods

Stata's reshape program is an essential tool for data prep work. However, it is well-known that the performance of reshape isn't great for large datasets -- see these benchmarking results and this Statalist topic for additional context. Because the poor performance of reshape on big datasets often imposes a significant barrier to my research team's work-flow (our reshapes can take hours!), I went ahead and coded up the suggestions in the previously-mentioned Statalist topic into an .ado-file that should work for any kind of reshape. I imagine that this program will be useful to anyone who uses Stata to process large datasets.

In short, fastreshape is significantly faster than reshape in most use cases, particularly for wide-to-long reshapes. I ran a number of benchmarks with Stata-MP on Stanford's cluster computing service, the results of which show that wide-to-long reshapes run between 2 and 15 times faster when using fastreshape. Similarly, long-to-wide reshapes run a modest (but still substantial) 1.5 to 5 times faster when using fastreshape.

The syntax and output of fastreshape mirrors reshape, with a few notable exceptions. For one, -fastreshape error- does not identify problem observations in cases where the program fails (as reshape does). Second, the atwl(chars) option is not yet supported. Lastly, fastreshape does not yet return all of the information in macro objects that reshape does. In my experience, these features are not particularly important, but I would like to implement them in the near future, and I don't imagine they will slow the program down at all. In addition, I have incorporated a new optional argument ('fast') that allows the user to skip sorting the dataset post-reshape for an additional modest performance boost. The default behavior is to sort by i and j for wide-to-long reshapes / sort by i for long-to-wide reshapes, as -reshape- does.

Although I think the program will replace the vast majority of reshape instances out of the box with no modification of syntax, I should caution that this program has not been tested by anyone other than myself, so there may be bugs. If you have any suggestions for additional functionality or would like to report a bug, please let me know in this topic, or alternatively create an issue / pull request on Github. I will continue to test the program over the next week or two before submitting to SSC. Thanks!

Read more here: https://github.com/mdroste/stata-fastreshape

Shout-outs to Robert Picard, Paul von Hippel, and Daniel Feenberg for the Statalist commentary that inspired this program.

↧

Bar chart for complete sample AND one subgroup

January 18, 2018, 3:05 pm

≫ Next: incorporating I2 from metan into hetred command

≪ Previous: fastreshape - more efficient implementation of reshape for big datasets

Dear All,

I don´t know if I will be able to describe my problem correctly, but I will give it a try.

I am trying to produce a Bar chart with the following characteristics: First bars for the percentages of the category calculated for the complete sample, and right next to those, bars for the percentages of the category for just one subgroup of the sample. The resulting graph would have 6 bars, which would follow this order:

Bar 1 - percentage of category 1 for whole sample,
Bar 2 - percentage of category 1 for subgroup 1,

Bar 3 - percentage of category 2 for whole sample,
Bar 4 - percentage of category 2 for subgroup 1,

Bar 5 - percentage of category 3 for whole sample,
Bar 6 - percentage of category 3 for subgroup 1,

I would also need to place the percentage labels on top of each bar. Can anyone help me? Sorry if the specification of my question is incorrect in any form.

Best regards,

Paolo Moncagatta

↧

incorporating I2 from metan into hetred command

January 18, 2018, 3:49 pm

≫ Next: Computing loan balances

≪ Previous: Bar chart for complete sample AND one subgroup

Hi Statalist

I want to input the I-squared global output from metan saved as $S_51 into the hetred command part i2h()

hetred logor se, i2h() i2l() id(author)

I can't get the syntax right - I've tried i2h($S_51) but get an invalid option error.
Any suggestions? Thanks in anticipation

↧

Computing loan balances

January 18, 2018, 7:10 pm

≫ Next: Panel Data using lagged variable

≪ Previous: incorporating I2 from metan into hetred command

Hi all:

I have a dataset with thousands of loans and I would like use a code to cycle through each loan to calculate the loan balance based on a desired month. Attached is a sample of 3 observations of 2 loan types for illustration. The last column “bal” is the ending loan balance – this is what I need.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte id long amt int(basedays amort) double(rate pmt) int(start term) byte io int(whichmth amortstart) long bal
1 2000000 360 360    .05 10736.43 21186  12 6   6 21367 1986552
2 3000000 360 360   .045 15200.56 21186   6 0   6     . 2976448
3 5400000 360 240 .04787 35009.25 21154 120 0 120     . 3373218
end
format %td start
format %td amortstart

Startdate is the date when the amortization begins for a loan unless a loan has an interest-only period or “io”>0 in which case the amortization begins after the expiration of the “io” period, i.e. from the “amortstart” date

If amort>0 and io>0, then “term” minus “io” equals “whichmth” and converted to date in “amortstart”
If amort>0 and io=0, then “term” = “whichmth”

FYI, basedays is the number of days in year, rate is the annual interest rate, and the rest are in months.

I have included a partial amortization schedule of the first loan. There is no amortization for the first 6 months since it is an interest-only period. Amortization begins in the 7^th month or “amortstart” date. At the end of 6 months, the balance is 1986552.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long date byte month double payment int(interest principal) long endbal
21367 0        .    .    . 2000000
21398 1 10736.43 8611 2125 1997875
21429 2 10736.43 8602 2134 1995740
21459 3 10736.43 8316 2421 1993319
21490 4 10736.43 8582 2154 1991165
21520 5 10736.43 8297 2440 1988725
21551 6 10736.43 8563 2174 1986552
end
format %td date

Sample computations for month 1 and so on till end of amortization (1 – 360):
Interest =(8/2/2018 – 7/2/2018)*(.05/360)*2000000 = 8611.11
Principal=10736.43 – 8,611.11 = 2125.32
Bal = 2000000 – 2125.32 = 1997875

I would appreciate any help in writing a code to extract the loan balances. Thanks

Best
Amrik

↧

Panel Data using lagged variable

January 18, 2018, 8:51 pm

≫ Next: Country/Time-Fixed Effects

≪ Previous: Computing loan balances

I have a question regarding the correct code for creating a lagged variable. It seems that the two versions generate different results.
I have a panel data- company id is my cross sectional variation and year is my time series variation.
I am trying to compute book leverage using total assets from a previous year. This is just for illustration purposes. My variable names are different.
Here are the two versions of the code. The first version is:

Code:

xtset companyid fyear, year
gen lTA = l.TA
gen Book_Leverage=total_debt/lTA

The second version is:

Code:

xtset companyid fyear, year
gen Book_Leverage=(total_debt)/(TA[_n-1])

Which code is correct. To me it seems like the second version of the code is giving slightly different results.
Thanks much!

↧

Country/Time-Fixed Effects

January 19, 2018, 12:45 pm

≫ Next: Merging Variables Measured on Same Levels in Same Dataset

≪ Previous: Panel Data using lagged variable

Hi all,

I am an absolute beginner in stata and urgently need your help.

I am doing a regression analysis on a panel dataset containing information on banks' balance sheets and income statements.
I want to regress a few dependent variables (Profits, Size, Collateral) on Leverage and cluster standard errors at bank/firm level. My code:
reg Leverage Profits Size Collateral, vce (cluster Bank_Name)

Further, the regression should account for country & time fixed effects using the -xtreg- command and the fe option. I already created a numerical identifier for country:
egen identifier_country = group(country)

However, if I run xtset identifier_country year I get the following error message: "repeated time values within panel". This is because I the dataset contains information on multiple banks in the same year in the same country and stata sees this as repeated values.

My question: How can I overcome this problem? Any suggestions?

Thank you!

Marius

↧

Merging Variables Measured on Same Levels in Same Dataset

January 19, 2018, 12:59 pm

≫ Next: Tabulating number of visits with long format dataset

≪ Previous: Country/Time-Fixed Effects

Hi All,

I apologize because this is likely a very obvious answer, but I'm a Stata beginner struggling to find a straightforward answer online. I have a dataset in which I want to merge counts on a categorical variable. I have data from a group of college students that chose from the same course list 7 courses, but each separate variable records the observations by what section they chose. Thus, I want to merge these variables to get the total count for each course, not just the total number of students that chose the course for each "block". Each course is represented in each variable. How would I do this in Stata? Thanks!

↧