Difference in Differences (DiD) setting to test mediation

January 1, 2018, 7:20 pm

≪ Previous: How do I interpret the result of zero-inflated Poisson regression?

Hi,

The short version of this question is whether it is feasible to test mediation by adding triple interaction terms with continuous mediator variables.

I'm in the middle of designing a "channel study" that intend to split direct & indirect effects of the treatment effect I found in my last project. In that project, I applied standard DiD approach with i.POST##i.Treatment.

Before conducting a standard path analysis/SEM. Now I'm thinking, if it's possible to reuse my previous work; and extend/modify the DiD setting to somehow test effects of hypothesized mediators. The general model in my mind is something like:

reg Y i.POST##i.Treatment i.POST##i.Treatment##c.Mediator1 i.POST##i.Treatment##c.Mediator2.......... i.IndFixedEffect i.YearFixedEffect

So the estimated coefficients on triple interactions (post x treatment x continuous mediator variable: X1,X2...) can be interpreted as the extent of treatment affect Y via moderating of X1 X2 etc.
I can not find any reference that specific to this approach other than Acemoglu, Autor and Lyle (2004) interpret the interaction with continuous variable as "intensity of treatment". While it is obviously not the same as what's in my mind. There are various concerns such as dilution of "main/direct effect"; offset each other etc. And my gut feeling is that it is very difficult, if not impossible, to simplify what essentially a path analysis into one regression; meanwhile got a set of comparable effects. However, I do not have the ability to convince myself the model above won't work, to any extent, either.

Please anyone can help me to determine if this would work and what coefs on triple interactions mean?

Thanks in advance,

Will

↧

Truncated rows

January 1, 2018, 8:42 pm

≫ Next: Trouble importing very large JSON file

≪ Previous: Difference in Differences (DiD) setting to test mediation

Hi everyone,
I am sure this has been covered somewhere, but I cant seem to find it. I am working with a data set which has quite long labels. The data is generally coded 0/1 etc, BUT when I run the analyses etc, I am l losing bits of the label, which makes interpretation a tad harder as the labels are very similar at the beginning. (see example below)

The command I used for this was

oneway OverallPPOS NEWLICTYPE if includePPOS==1, tab

Type of | Summary of OverallPPOS
placement | Mean Std. Dev. Freq.
------------+------------------------------------
Block rot | 4.6543969 .46897045 165
Catergory | 4.4682872 .52847999 109
Category | 4.6636636 .41328069 74
Category | 4.5273376 .49331265 271
------------+------------------------------------
Total | 4.5671056 .48911722 619

Now lines 2-4 for type of placement should be Category A - C....
How do I get STATA to extend the column out so that I can SEE the whole label in the generated results? (I suspect is something small after tab, but I am not sure WHAT)
Thanks so much
Zelda

↧

Trouble importing very large JSON file

January 1, 2018, 8:46 pm

≫ Next: Double hurdle model or any alternative for the below data set

≪ Previous: Truncated rows

I'm trying to import several large json files (ranging from 2.5 to 4 GB, each) into stata. The files are named 2008.json to 2012.json, one for each year. (They contain patent application information, downloaded from the 'download entire data set' option from https://ped.uspto.gov/peds/.)

I don't need every field in each of these files, but the file structures seem relatively complex. I initially thought I'd use insheetjson, but two problems arise. First, I tried -insheetjson using 2009.json, showresponse- and stata returned the following error:

Code:

fread():   691  I/O error
libjson::getrawcontents():     -  function returned error [17]
injson_sheet():     -  function returned error
<istmt>:     -  function returned error

Second, one of the features of my dataset is that once it's flattened, there is more than one item of the same name (that I want to use). For instance, there is a field called "value" within the node "applicationNumberText", and also a field called "value' within the node "groupArtUnitNumber". I wasn't sure how to operationalize this within insheetjson.

I also tried William Buchanan's jsonio package. Specifically, I tried -jsonio kv, file("2009.json") nourl-. But that returned a long error that begins with this text:

Code:

java.lang.OutOfMemoryError: Java heap space
        at java.util.LinkedHashMap.newNode(LinkedHashMap.java:256)
        at java.util.HashMap.putVal(HashMap.java:630)
        at java.util.HashMap.put(HashMap.java:611)
        at com.fasterxml.jackson.databind.node.ObjectNode.replace(ObjectNode.java:397)
        at com.fasterxml.jackson.databind.deser.std.BaseNodeDeserializer.deserializeObject(JsonNodeDeserializer.java:250)

I'm aware that json files are basically text and can be parsed using regular expressions, but I'm quite novice with regex and not sure where to begin, especially given the note above about more than one item with the same name ("value"). I've attached a sample with a few records (the type is .txt but if it makes you happy you can replace that with .json), and hoping that someone can offer some suggestions. The fields I'd like to pull off are:
"applicationNumberText":{"value"
"applicationNumberText":{"electronicText"
"filingDate"
"applicationTypeCategory"
"groupArtUnitNumber":{"value"
"groupArtUnitNumber":{"electronicText"
"nationalClass"
"nationalSubclass"
"publicationNumber"
"publicationDate"
"patentNumber"
"grantDate"

And I believe these are all 1:1 within a record (a patentRecordBag) – this is the case with the sample data, though I admit I'm not certain about the full files. (And not sure how to find out - I was able to discern elements of the object structure by using the online json formatter tool at jsonformatter.curiousconcept.com, but I only fed that my sample records, not the full many-GB files - I assume that insheetjson, showresponse or jsonio kv would help, but those didn't work in this case.)

Any help is very much appreciated – thanks!

↧

Double hurdle model or any alternative for the below data set

January 1, 2018, 10:32 pm

≫ Next: How do I interprete the differences in ATT after endogenous switching regression

≪ Previous: Trouble importing very large JSON file

Dear all,

I am trying to find a suitable model for a data set having two decisions:

1. Decision to participate (=1 if Yes and 0 otherwise) - Binary
2. If yes, then discrete choice options (5) - ordered/multinomial

Whether a double hurdle model can be fitted to this data? It would be great if any one in the forum could suggest a suitable model and syntax for the same.
Looking forward.

Best,
Vikram

↧

How do I interprete the differences in ATT after endogenous switching regression

January 2, 2018, 3:02 am

≫ Next: Word cloud and sentiment analysis (text mining - content analysis) in Stata

≪ Previous: Double hurdle model or any alternative for the below data set

Dear Stata Community,
Please, this is a gentle reminder about a question I posted previously. So please forgive me if am violating the rules of the community.
I am conducting a study to determine the impact of credit access on farm income.The credit access variable is binary (i.e. 1=access to credit and 0= no access to credit) . To control for endogeneity arising from credit access, I employed an endogenous switching regression to a pooled sample from savanna and transitional zones of Ghana. Since the endogeneous switching regression control for endogeneity arising from both observed and unobserved factors, I expected the average treatment effect on the treated (ATT) for adopters in the Transitional and Savanna zones to be the same. However the ATT vary for the two zones even though I have controlled for endogeneity arising from observed and unobserved factors. Please could you kindly help me with the interpretation? Specifically, the ATT for the Transitional zone is $2156.36 whiles that of the Savanna zone is $1501.20 and I would like to know the reason behind the differences in ATT after controlling for all potential endogeneities.
Your help is urgently needed and would be appreciated.

Thank you.
Abdallah Alidu.

↧

Word cloud and sentiment analysis (text mining - content analysis) in Stata

January 2, 2018, 3:31 am

≫ Next: Creating variable with number of non-missing values in panel data

≪ Previous: How do I interprete the differences in ATT after endogenous switching regression

Dear Forum Members,

I'll need to apply content analysis (text mining) strategies in a recent project of mine. However, I've found far less information/resources in Stata, if compared with R, for example. That said, I really wish to stick with Stata resources as much as possible for the analysis.

With regards to the analysis of words, I'm delving with the user-written ngram, precoin and coin. Also, I checked out other programs, as mentioned in this Stata Meeting.

That said, I'm facing a couple of obstacles: first, the issue on the exceeding amount of words, as previously reported here. (For this, hopefully, a higher flavour of Stata - instead of IC - will do the trick, and I decided to do the upgrade).

Besides, I got the impression that, contrary to what I'm getting with R, most programs in Stata won't perform well with large chunks of texts as well as a large sample size, as it will be my scenario.

Second, unfortunately, I haven't yet found command/program concerning key steps of text mining I'm eager to apply, such as sentiment analysis graphs and word cloud renditions.

On account of this situation, I wonder whether you could help with some guidance.

Thank you in advance.

↧

Creating variable with number of non-missing values in panel data

January 2, 2018, 3:40 am

≫ Next: Plotting confidence intervals

≪ Previous: Word cloud and sentiment analysis (text mining - content analysis) in Stata

Hi.

I have a panel data of share prices for all Compustat universe (from 1950 until now). I want to create a variable which consists of counts of non-missing share prices of prior years. For example, say I have firm A in the year 2017 with valid share price. For the same firm the share price is missing in 2016 but it is valid in 2015 and 2014. So my count observations for this firm should be 3 in 2017, 2 in 2016, 2 in 2015 and 1 in 2014. These mean that in 2017 (including this year too) the firm had 3 available share price observations, i.e. 2017, 2015 and 2014. How can I code this in Stata? I attach a sample table of my data below. Thanks in advance.

N	ID	Year	Share Price	Count
1	1	2017	0.789955	3
2	1	2016	0.562119	2
3	1	2015	.	1
4	1	2014	0.055518	1
5	2	2017	0.058472	2
6	2	2016	.	1
7	2	2015	0.48902	1
8	2	2014	.	0
9	3	2017	0.683695	4
10	3	2016	0.608817	3
11	3	2015	0.754929	2
12	3	2014	0.216465	1
13	4	2017	.	3
14	4	2016	0.144847	3
15	4	2015	0.442283	2
16	4	2014	0.681256	1
17	5	2017	0.832072	3
18	5	2016	.	2
19	5	2015	0.218486	2
20	5	2014	0.632021	1

↧

Plotting confidence intervals

January 2, 2018, 6:20 am

≫ Next: Cause of Churdle Command code running for hours

≪ Previous: Creating variable with number of non-missing values in panel data

Dear all,

I am doing a 2SLS to grasp the relation between legal crop prices and coca production for a range of crops in Colombia. To test whether my results are driven by a specific region in Colombia I perform my regression also only for this specific region. This is how the code for my baseline regression looks like:

Code:

foreach var of varlist cocaprod{
    xi: xtivreg2 `var' lpop ylxlrer_ban ///
        yearxMuncode* YearInd*xWkmean    ///
        (yieldxprice  = yieldxtop3export) ///
        i.year, fe cluster(dept) partial(i.year) first
        outreg2 using CI_Analysis.xls, se bdec(3) tdec(3) nocons excel
        }

in which (yieldxprice = yieldxtop3export) is my first stage and cocaprod is my y-variable. All other variables are controls.

Now I would like to compare the confidence intervals of this regression with the regression when excluding a specific region (so basically comparing confidence intervals of two regressions) in a plot or graph in Stata. Is there a command for this?

I am using Stata 14.2

Help would be very much appreciated!

Best,

Sophie

↧

Cause of Churdle Command code running for hours

January 2, 2018, 6:44 am

≫ Next: In a dataset of 2014 observations, 8 were missing, then added with multiple imputation. When I run analysis now is shows 2070 observations

≪ Previous: Plotting confidence intervals

Hello,

I've been trying to run the Churdle code below, and it runs for hours. The sample size is 125,000 observations. Is it normal for a Churdle command to run for hours with a sample size of 125,000, or is there something wrong with the data?

[use "D:\STATA\stata_crc.dta", clear]
[svyset [pweight = finlwt21]
[svy: churdle linear yhealth_care_new i.age_y35 i.age_y68 i.age_y911 i.age_y1214 i.age_y1517 if hw_kids , select (income2 income3 num_child1 num_child3) ll(0)]

Thanks in advanced for your help.

Alexis

↧

In a dataset of 2014 observations, 8 were missing, then added with multiple imputation. When I run analysis now is shows 2070 observations

January 2, 2018, 7:33 am

≫ Next: Logistic Regression w/ mildly significant Dummy Variable

≪ Previous: Cause of Churdle Command code running for hours

Hi StataList,

I am doing a research on subjective well-being, using cross-sectional survey-data for two years (not panel). The integrated datset have some missing values for my dependant varibales, Life satisfaction (8 missing values) and Happiness (28 missing values). Even though this may not sound like a big number, I am focusing on a small groups from my dataset, on each year separately and every obseravtion matters for the size of my sample. So I decided to proceed with multiple imputation. I followed the steps described in the following book: Mehmet Mehmetoglu and Tor Georg Jakobsen (2016) 'Applied Statistics Using Stata: A Guide for the Social Sciences' and did the process only for Life Satisfaction first. I also compared my comands with some youtube videos and it looks all good.

The results I am getting for the regression after the imputation are based on the entire sample size, which is 2014. I assume this means that the imputation was correct and succesful. However,once I save my dataset and then reopen to run regressions, it gives me results where the number of the observations 2070 and this now exceeds the regular size of the sample. How is this possible? Where did I make mistake? I guess I need to do something different when I am saving my data. Or should I use always 'mi estimate: regress...' even after the imputation was done and the data was saved? I assume there is a way to save the data with imputed variable as a new dataset where I can then run different analysis without incluidng 'mi estimate' every time.

I really appreciate your time to read my post and come with any suggestions,
Best wishes,
Mirjana

↧

Logistic Regression w/ mildly significant Dummy Variable

January 2, 2018, 8:36 am

≫ Next: Reshape from Wide to Long w/ Time Intervals in the Rows

≪ Previous: In a dataset of 2014 observations, 8 were missing, then added with multiple imputation. When I run analysis now is shows 2070 observations

Hello all,

Wondering if I can get some guidance from those more informed that I.

My regression results are below. I believe they show relatively strong evidence that the independent variables have non-zero effects, correct?

My main query concerns the inclusion of the less significant dummy "SAC" variables. Specifically, sac2 and sac4.

There are 6 "sac" types in the data set, 1-6, and I have gone about creating the 5 sac type dummy variables where if Sac type equals 4, then sac4 equals 1 otherwise it equals 0, and so on.

What considerations would one make in deciding whether it was reasonable to include sac2 and sac4 in the model? My thought at this point is that there is some evidence of significance and an argument can be made that it would be logical for the variable to be significant. Must I exclude the variables or can it be reasonable to retain mildly insignificant variables when others in the set of dummy variables are significant?

Thanks for any help provided!

. logit imp csmin csmin2 tds2 etlrt2 ltv2 minage2 sac1 sac2 sac3 sac4 sac5 if funded==1

Iteration 0: log likelihood = -411.67704
Iteration 1: log likelihood = -383.59557
Iteration 2: log likelihood = -370.40427
Iteration 3: log likelihood = -368.72347
Iteration 4: log likelihood = -368.68813
Iteration 5: log likelihood = -368.68812

Logistic regression Number of obs = 2,001
LR chi2(11) = 85.98
Prob > chi2 = 0.0000
Log likelihood = -368.68812 Pseudo R2 = 0.1044

------------------------------------------------------------------------------
imp | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
csmin | .0323853 .0081642 3.97 0.000 .0163838 .0483868
csmin2 | -.0000339 7.59e-06 -4.47 0.000 -.0000488 -.000019
tds2 | .0230267 .0107251 2.15 0.032 .0020059 .0440475
etlrt2 | -3.15592 1.072801 -2.94 0.003 -5.25857 -1.053269
ltv2 | -3.380574 1.511915 -2.24 0.025 -6.343873 -.4172758
minage2 | .0002754 .0000821 3.35 0.001 .0001145 .0004363
sac1 | -.8622301 .4260937 -2.02 0.043 -1.697358 -.0271017
sac2 | -.73401 .4647683 -1.58 0.114 -1.644939 .1769192
sac3 | -.9593358 .4495987 -2.13 0.033 -1.840533 -.0781384
sac4 | -.9076598 .4818127 -1.88 0.060 -1.851995 .0366756
sac5 | -1.402417 .4309326 -3.25 0.001 -2.247029 -.5578043
_cons | -5.753695 2.571938 -2.24 0.025 -10.7946 -.7127887
------------------------------------------------------------------------------

↧

Reshape from Wide to Long w/ Time Intervals in the Rows

January 2, 2018, 9:42 am

≫ Next: Pre/Post survey analysis

≪ Previous: Logistic Regression w/ mildly significant Dummy Variable

Array

Here peacescale is the variable I want, and the time periods are in YYYYMMDD format. How can I put the above table in long format where each row is the dyad-year?

Thanks!

↧

Pre/Post survey analysis

January 2, 2018, 9:47 am

≫ Next: Formatting Stata Results

≪ Previous: Reshape from Wide to Long w/ Time Intervals in the Rows

Hi all, thanks in advance for your help.
I'm a student who needs help on pre/post survey data analysis.
I have survey data on students before and after an intervention.
The data is composed of 5 parts: strongly agree, agree, neutral, disagree, strongly disagree, which I've coded as 1,2,3,4,5
PRE POST
Student 1 1 4
Student 2 3 3
Student 3 3 5
...

Does anyone have any suggestions on how to find whether the intervention had a significant effect? I want to know whether the intervention worked (as in, did it help students get more interested in the field?) The intervention was a shadowing program with the purpose of helping students get more interested in a particular career field. I've done t-tests in the past but I don't know how to do an analysis where there are 5 different components in a student's response. Any help would be appreciated.

With thanks,
Sun

EDIT- and how do I get a sense of which direction that the intervention went? Whether the intervention convinced them in one direction or another.

↧

Formatting Stata Results

January 2, 2018, 10:20 am

≫ Next: Skewness adjusted t-test

≪ Previous: Pre/Post survey analysis

When writing one's own Stata commands, what is the best way to ensure that the output in the Results window is nicely formatted in tabular form and can be cut and pasted into a table. I've been using the list command to display Results, but there must be something better.

I've also had mixed success cutting and pasting Results from other commands into a table. Not sure if the problem is me, or if some commands format Results in a friendlier way than others.

↧

Skewness adjusted t-test

January 2, 2018, 10:25 am

≫ Next: Statistical significance in each part vs. joint significance in two-part models using -twopm-

≪ Previous: Formatting Stata Results

Dear all,

I'm required to do a skewness adjusted t-test on stock return data. I have daily log returns for 952 firms over a period of 488 trading days.
The skewness is -1.75 found using:

Code:

sum r, detail

I installed the skewt package, which gave me the outcome:

Code:

. skewt tbhar
(463,624 observations deleted)

    tbhar- stats from the sample

    N coefficient  = 30.85449724108302
    S-coefficient  = -.0645222099403743
    G-coefficient  = -.4476181276901068
    Sample mean    = -.0069934897938696

but I'm unsure on how to proceed.

Normally I would have calculated the ttest using:

Code:

ttest tbhar==0

Which gives me the mean, st error, st dev, t-stat, and probability, as shown below.

Code:

ttest tbhar==0

One-sample t test

Variable      Obs        Mean    Std. Err.   Std. Dev.    [95% Conf. Interval]

tbhar      952   -.0069935    .0035129    .1083889    -.0138874   -.0000996

mean = mean(tbhar)        t =  -1.9908
Ho: mean = 0    degrees    of freedom =      951

Ha: mean < 0    Ha: mean != 0    Ha: mean > 0
Pr(T < t) = 0.0234         Pr(T > t) = 0.0468    Pr(T > t) = 0.9766

I hope someone can help me with the next step after the skewt test.

Edit:
If I run:

Code:

 hallt tbhar, bs reps(100) size(2) saving ("X:\My Documents\final bs.dta")

(Basically copying the example in the help file)

I get:

Code:

. hallt tbhar, bs reps(100) size(2) saving ("X:\My Documents\final bs.dta")
(463,624 observations deleted)

    tbhar- stats from the sample

    N coefficient  = 30.85449724108302
    S-coefficient  = -.0645222099403743
    G-coefficient  = -.4476181276901068
    Sample mean    = -.0069934897938696

(running hallt on estimation sample)

Warning:  Because hallt is not an estimation command or does not set e(sample),
          bootstrap has no way to determine which observations are used in
          calculating the statistics and so assumes that all observations are
          used.  This means that no observations will be excluded from the
          resampling because of missing values or other reasons.

          If the assumption is not true, press Break, save the data, and drop
          the observations that are to be excluded.  Be sure that the dataset in
          memory contains only the relevant data.

Bootstrap replications (100)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
..................................................    50
..................................................   100

Bootstrap results                               Number of obs     =        952
                                                Replications      =        100

      command:  hallt tbhar
        _bs_1:  r(ratio_tbhar)

------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _bs_1 |  -2.012445   1.042519    -1.93   0.054    -4.055744    .0308536
------------------------------------------------------------------------------

.

But I'm not sure how to interpret this.

↧

Statistical significance in each part vs. joint significance in two-part models using -twopm-

January 2, 2018, 11:10 am

≫ Next: Panel Logistic Regression for bankruptcy prediction

≪ Previous: Skewness adjusted t-test

Dear Statalisters,
I am using the user-written command -twopm- (Belotti, Deb, Manning, Norton (Stata Journal 15(1), 2015)) for my project, and was wondering if some one could help me with some intuition (or relevant literature) on a covariate which has opposite signs in the binary part and the positive outcome part, and is found to be jointly significant in both parts based on the Wald test following -twopm-. In particular, I am not clear as to what is the difference between "statistically significant in each part" as opposed to "jointly significant in both parts" of the two-part model refer to, and was wondering if the authors refer to the difference between conditional/actual outcome expectation (in each part) and unconditional/potential outcome expectation (joint significance) when they discuss the Wald test. If I understand correctly, Dow and Norton (Health Services & Outcomes Research Methodology (4) 5–18, 2003) seem to prefer "actual outcomes and potential outcomes" instead of conditional mean and unconditional mean, respectively. Any clarification will be helpful and greatly appreciated.

Sincerely,
Suryadipta

↧

Panel Logistic Regression for bankruptcy prediction

January 2, 2018, 11:19 am

≫ Next: Dummy if variable does not change over time in longitudinal data

≪ Previous: Statistical significance in each part vs. joint significance in two-part models using -twopm-

Hi, i'm using Stata to bulid a model to predict bankruptcies in Europe, using financial and macroeconomic information. I'm quite new to Stata and i would like to know how to perform something like the ROC curve to define cut off points with xtlogit, not with logit.

Thanks in advance.

↧

Dummy if variable does not change over time in longitudinal data

January 2, 2018, 1:20 pm

≫ Next: How to add optimal breakpoint into regression models

≪ Previous: Panel Logistic Regression for bankruptcy prediction

Hi all,

I am working with longitudinal data where each row contains a userid, month, and employment status that month. I want to create a dummy variable which is 1 if a person "retires" - that is, if their employment status is "no job" this month and, for all following months, their employment status is "no job." Basically I want to avoid capturing temporary exits. Additionally, if they do not work for the entire panel, then I want to classify them as "retired" as well.

If there is an older Statalist question which addresses this, please let me know. Thank you for your help.

↧

How to add optimal breakpoint into regression models

January 2, 2018, 8:55 pm

≫ Next: thresholdtest takes too much time and no results

≪ Previous: Dummy if variable does not change over time in longitudinal data

Hello,

I am working on time series regression, quarterly data from 1965 to 2016. I realized the Clementes, Montanes and Reyes tests for additive and innovative breakpoints. Optimal breakpoints are 1972q4 and 1982q1 for AO models - additive breakpoints, and 1972q4 and 1982q2 for IO models - innovative breakpoints.

I want to add a dummy variable with the value 1 for 1972q4 and 1982q1 and 0 everywhere else for the additive breakpoints. I also want to add a dummy variable for the innovative breakpoint for all the period between 1978 to 1985 since the governmental policy that may cause this breakpoint was on during these 7 years ( the CAFE fuel program in the USA).

Thank you for you help,
Carolann

↧

thresholdtest takes too much time and no results

January 2, 2018, 11:33 pm

≫ Next: reg with an unbalanced panel

≪ Previous: How to add optimal breakpoint into regression models

I am trying to indentify if the federal funds rate have a threshold point on the decision of the firms to expand by size. (I deal with a firm level panel data)

I run a thresholdtest in order to find if there is a significant thershold point.

thresholdtest size ffr , q( ffr ) trim_per(0.15) rep(5000)

The problem is that the test takes too much time (more than one hour) and no results from STATA.
The command does not produce a result nor an error.

Any help?

↧