Random (clustered) sampling without replacement keeping two strata population proportions

October 14, 2019, 3:53 am

Dear Statalist,

My first post on this site so please bear with me for any 'mild' transgressions I make. I've been using Statalist for quite some time, great resource that has solved most of my encountered problems and questions. I could not find any current thread that targets my current dilemma however.

Long story short, I want to create a (random) replica/miniature version of my population for testing some model fitting. I have year-firm rows as my uniquely identifying observation (panel data) between 1998-2017. When doing so, I want to maintain the proportion of two stratas, the first being the occurence of defaults (binary variable "Def_1y", =0 for non-default and =1 for default), the second the proportion of yearly data (variable "ser_year"). E.g. if 1% of my firm-year observations are defaults (the remaining 99% being non-defaults) and my year= 2017 data (for example) is 20% of the firm-year observations, then when I draw a random sample I want to maintain these characteristics. The firms are observed on several occassions and so when I randomly draw a, say, 60% random sample of the population I want to draw without replacement and ensure that I am drawing all of the firm observations for each selected firm (i.e. clustering the firm). E.g. If a firm (variable "orgnr", an organizational number) has not defaulted years 1998 and 1999, but defaults the next year 2000, I want to make sure that if this particular firm is selected all of its observations are included. Allthewhile keeping the proportions of default and yearly observations.

I hope this makes sense, extract of my relevant data (in order: "orgnr", "ser_year", "Def_1y") below:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double orgnr float(ser_year Def_1y)
5560001538 1998 0
5560001538 1999 0
5560001538 2000 0
5560001991 2002 0
5560001991 2003 0
5560001991 2004 0
5560001991 2005 0
5560001991 2006 0
5560001991 2007 0
5560001991 2008 0
5560001991 2009 0
5560001991 2010 0
5560001991 2011 0
5560001991 2012 0
5560001991 2013 0
5560001991 2014 0
5560002296 1998 0
5560002296 1999 0
5560002296 2000 0
5560002296 2001 0
5560002296 2002 0
5560002296 2003 0
5560002296 2004 0
5560002296 2005 0
5560002296 2006 0
5560002296 2007 0
5560002296 2008 0
5560002296 2009 0
5560003682 1999 0
5560003682 2000 0
5560003682 2001 0
5560003682 2002 0
5560003682 2003 0
5560003682 2004 0
5560003682 2005 0
5560008293 2007 0
5560008293 2008 0
5560008855 1998 0
5560008855 1999 0
5560010554 2004 0
5560010554 2005 0
5560010554 2006 0
5560010554 2007 1
5560010554 2008 0
5560010554 2009 0
5560010554 2010 0
5560010554 2011 0
5560010554 2012 0
5560010554 2013 0
5560010554 2014 0
end

I have been trying to reach this effect using -gsample- in Stata 16 (see below), but it has not produced the results I am looking for.
(i) When specifying 2 stratas it almost haphazardly fails to account for clustering (as far as I can tell). I.e. a firm may form part of the sample in one year but not the next, which is not desirable. (ii) When specifying just 1 strata instead ("Def_1y"), the below code almost perfectly accounts for clustering (some firm-years are however not included when the firm is sampled, but I guess it will never be perfect given its constraint to simultaneously keep default proportions). However, the distribution of yearly observations does not mimic the population distribution, which is not desirable.

Code:

 gsample 60, percent wor strata(Def_1y ser_year) cluster(orgnr) keep generate(sample60)
gsample 60, percent wor strata(Def_1y) cluster(orgnr) keep generate(sample60)

Do I have to break-up the sampling into multiple stages? Might be a complex solution to keep clustered firm sampling while maintaining the (rare) default and year proportions.
Again, I am just looking to create a ca 60% random sample mimicing the population proportions of defaults ("Def_1y") and yearly observations ("ser_year"), clustered on firm so that if a firm-year is selected all other firm-year observations for that firm are also selected. Without replacement.

Any advice would be extremely helpful and appreciated,

Best,
John-Edward

↧

new reshape issue

October 14, 2019, 4:40 am

≫ Next: Is ~ a valid character in a variable name?

≪ Previous: Random (clustered) sampling without replacement keeping two strata population proportions

Sorry, this should be straightforward but I can't figure it out even after trying and using the manual. I was able to collapse the data I asked about before but I made a mistake for one country that has two different values in the data set (and the one I want to merge with it). Italy is Italy/Sardinia for observation number 90 and Italy for observation 206. I need to replace the observations with a score of 0 in observation 90 with the values of the variables for observation 216. I tried a replace in a loop that included:
replace gwno1901[_216] if *[_90]==0

but got:

weights not allowed

↧

Is ~ a valid character in a variable name?

October 14, 2019, 5:01 am

≫ Next: convert concatenate strings into numeric

≪ Previous: new reshape issue

-import delimited- and -insheet- create variable names with a tilde when the name in the first line of the data file too long. But I find that the -generate- command won't allow such names on the right-hand side. I can -rename-, but I would prefer to avoid the extra work. For example here I can -describe- or -rename- a variable, but -generate- thinks it has an "invalid name".:

Code:

import delimited using /tmp/Links_2007.txt
(18 vars, 99,999 obs)

. des shareholderbv~d

              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------
shareholderbv~d str19   %19s                  Shareholder BvD ID

. gen x=shareholderbv~d
shareholderbv~d invalid name
r(198);

. rename shareholderbv~d x0

. des x
              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------
x               str19   %19s                  Shareholder BvD ID

Am I overlooking something very simple? Any suggestions?

Daniel Feenberg
NBER

↧

convert concatenate strings into numeric

October 14, 2019, 5:21 am

≫ Next: Access to stata

≪ Previous: Is ~ a valid character in a variable name?

Dear Stata users,

I have a data like below, the researchers input variables as alphabet. Now I want to convert those strings into numeric such that "A" as "1", "B" as "2", "C" as "3". It is easy to do when string has only one alphabet, but in cases that strings was concatenated as "A,B,C", how can I address it? Thank you in advance for advice.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str20 x1 str22 x2 str18 x3
"A"     "A"       "C"      
"A,B"   "A,C"     "B,D"    
"B,C"   "C,G"     "A"      
"A,B,C" "C"       "B,C,D,E"
"B"     "B"       "B"      
"C"     "C"       "A"      
"A,B"   "A,C"     "B,D"    
"B"     "E"       "E"      
"A"     "B,C,D,F" "A"      
"A,B"   "B"       "A,B,C,E"
"B"     "A,F,G"   "B,C,E"  
end

↧

Access to stata

October 14, 2019, 5:46 am

≫ Next: PVARSOC error(2001)

≪ Previous: convert concatenate strings into numeric

Hello!
How export data from access to stata with codes and labels.
Thanks

↧

PVARSOC error(2001)

October 14, 2019, 7:01 am

≫ Next: decision on whether to weight regression model

≪ Previous: Access to stata

Dear all,

I'm using stata 15 for estimate a PVAR model on 88 observations of 8 countries. I would like to get the optimal lag numbers using the command proposed by Michael Abrigo. However, stata give me this answer: Cannot have fewer observations than parameters (r(2001)). Please I need help to continue my estimations.

my command is:

pvarsoc TID INF IIF lCE PIBT ,maxlag(3) pvaropts(instl(1/3))

Thank you

↧

decision on whether to weight regression model

October 14, 2019, 7:20 am

≫ Next: how to deal with duplicates when creating propoprtions

≪ Previous: PVARSOC error(2001)

I am studying the impact of a particular policy on the number of hospitalizations of 1 year old children. My data is collapsed by municipality and a indicator variable for 12 months before vs after the policy took place. My question refers to whether I should weight the regression by the number of 1 year old children. I understand that if I was studying the effect on the hospitalization rate I would do so in order to get a representative population effect. However, I am not sure if I should do so when my dependent variable is an absolute number (and, thus, larger municipalities will tend to have higher number of hospitalizations - hence, this representativeness is already taken into account by construction).

↧

how to deal with duplicates when creating propoprtions

October 14, 2019, 8:30 am

≫ Next: Panel regression - including fixed effects AND clustering standard errors

≪ Previous: decision on whether to weight regression model

Dear all,
I need your valuable help and advice in the following please:
I have the following data , in which subject-id may have several visits dates and types (could be either screening , sampling or repeat sampling) and they may have different or same disease stage at each visit.
I want to answer the following question: how many subjects had disease stage 1 at their first visit ?
( so basically I want stata to only consider first visit only when calculating the proportion of certain disease category without having to drop the other observations)

Subject-id	visit-date	visit-name	disease-stage
001	01/01/2017	screening	1
001	10/02/2018	sampling	1
001	01/01/2019	screening	1
001	05/06/2019	rep sampling	1
002	13/11/2016	screening	4
002	20/12/2016	sampling	4
003	09/04/206	screening	3
003	10/04/2016	sampling	4
004	11/05/2019	screening	2

TIA

↧

Panel regression - including fixed effects AND clustering standard errors

October 14, 2019, 8:36 am

≫ Next: i.year vs year*

≪ Previous: how to deal with duplicates when creating propoprtions

Hi everyone,

I am running a panel regression on industry returns (49 Fama French industries) over time (1994-November until 2014-December). The (cross-sectional) industry identifier is industrynr and the time variable is Observation. I was trying to run this with the reghdfe command. I got an error message (see attachment) and can't resolve it for some reason.

Does anyone know how to resolve this?

Thanks in advance,

Daniel

↧

i.year vs year*

October 14, 2019, 9:44 am

≫ Next: Error running a kmeans cluster analysis - error message: factor variables and time-series operators not allowed

≪ Previous: Panel regression - including fixed effects AND clustering standard errors

Dear all,
My aim is to estimate a model with country and year fixed effects. I both have a variable with the years in long format, and a series of dummy variables for each year.
My concern arises because I have different estimates and standard errors depending on whether I am using

HTML Code:

Year*

HTML Code:

i.Year

in my model.

The model looks something like this

HTML Code:

xtreg y x1 x2 x3 i.Year, fe robust cluster(id)

. However if I estimate the model in the thus:

HTML Code:

xtreg y x1 x2 x3 Year*, fe robust cluster(id)

estimates change and in general, appear to be more significant. Does anyone know why this is the case? I thought these two syntaxes where synonyms. Which estimates are more reliable if my aim is to account for unobserved time heterogeneity?

Thank in advance for your help

↧

Error running a kmeans cluster analysis - error message: factor variables and time-series operators not allowed

October 14, 2019, 9:46 am

≫ Next: Why is the output from Graph different on two different computers?

≪ Previous: i.year vs year*

Dear Members,

I am trying to learn how to perform a cluster analysis. I wish to apply it to the the level of agreement for a set statements, measured via a Likert scale that goes from 1 to 4.

I have 20 variables each indicating how a certain feature of electric cars is perceived as a barrier to their purchase. These 20 variables can take only the following values: 1, 2, 3, 4. With respect to the proposed statement, 1 indicates that the individual completely disagrees with it, 2 that she partially disagrees, 3 that she partially agrees, and 4 that she totally agrees. They are stored in my database in the following fashion (the image indicates one of the 20 variables).
Array

My idea to run a cluster analysis and check if I am able to group individuals into meaningful associations.

I tried to run the following command

Code:

cluster k planning anxiety k(3)

but I got the following error message:

factor variables and time-series operators not allowed
r(101);

I am stuck at this point and I got no results.

This is probably trivial issue, and I do apologise if this is the case. Looking into this forum and online I was not able to find a solution to this problem.

I would be very grateful if any of you could provide me with an insight.

Marco

↧

Why is the output from Graph different on two different computers?

October 14, 2019, 9:57 am

≫ Next: Simpler way to count observation count of a variable?

≪ Previous: Error running a kmeans cluster analysis - error message: factor variables and time-series operators not allowed

I have encountered a very strange issue and I wanted to get the forum's thoughts on what the problem might be and how to fix it. Over the weekend I wrote Stata code to produce various bar graphs; code that I wrote on my personal copy of Stata on my laptop. Today I copied the working directory to my office workstation to continue working on the code. I casually reran the dofile on my workstation to start where I left off, and noticed that the background grid on the bar graphs now have giant gaps in them! I've attached example output to show what I mean:

Laptop output:
Array

Workstation output:
Array

Here is the code that produces the figure (the same code was executed on both machines from identical datasets):

Code:

qui sum frac_alldwnlds
graph bar ///
    frac_alldwnlds, over(ep, sort(frac_alldwnlds) desc label(alt tick)) ///
    title("All Downloads of Kids Considered") ///
    subtitle("Each episode's share of 6,833 downloads") ///
    ytitle("Percent (%)") ylabel(0(2)10, glp(dot) glc(grey) angle(horizontal)) ///
    yline(`r(mean)', lpattern(dash)) ///
    lintensity(*-255) name(frac_alldwnlds, replace)  graphr(c(white))
graph export "$figs\frac_alldwnlds.png", as(png) width(1200) replace

The images I have attached are in PNG format, but the graphs look exactly the same in Stata. The only thing that is different is that my laptop has Stata 15.1 and my workstation has Stata/SE 14.2. Could this produce such different output?

↧

Simpler way to count observation count of a variable?

October 14, 2019, 10:10 am

≫ Next: Help with Looping

≪ Previous: Why is the output from Graph different on two different computers?

Hi,

I have data similar to the below where I am looking at the products sold in stores over 3 years:

Store Name	Product	Year	Store_Product
Tesco	Biscuits	2010	1
Tesco	Biscuits	2011	1
Tesco	Biscuits	2012	1
Tesco	Water Bottles	2010	2
Tesco	Water Bottles	2011	2
Tesco	Cakes	2010	3
Tesco	Cakes	2011	3
Asda	Biscuits	2010	4
Asda	Biscuits	2011	4
Asda	Water Bottles	2010	5
Asda	Cakes	2010	6
Asda	Cakes	2011	6
Asda	Cakes	2012	6
Sainsburys	Water Bottles	2010	7
Sainsburys	Water Bottles	2011	7
Sainsburys	Water Bottles	2012	7
Sainsburys	Cakes	2010	8
Sainsburys	Cakes	2011	8
Sainsburys	Cakes	2012	8
Morrisons	Biscuits	2011	9
Morrisons	Biscuits	2011	9
Morrisons	Water Bottles	2010	10
Morrisons	Cakes	2011	11

Store_Product is a new variable I have made that combines the store name and product and gives these a unique identifier. For example Biscuits in Tesco would be a combination, Biscuits in Asda would be a different combination and so on.

I want to see how many times these combinations are observed based on the time column. For example, store_product combination 1 is observed 3 times (Tesco Biscuits in 2010, 2011 and 2012), whilst store_product combination 2 is observed twice (Tesco Water Bottles 2010, 2011). I was able to obtain this via "tab product year" however this results in a long table. I`m aware of the totals after manually adding in Excel but is there a simpler way to bring this up in Stata please?

Thank you!

↧

Help with Looping

October 14, 2019, 10:14 am

≫ Next: Trying to run regression discontinuity with rdrobust: getting an error. How to solve it?

≪ Previous: Simpler way to count observation count of a variable?

Good day, Statalist,

Please, I am carrying out an analysis for 5 waves of a survey data. Hence I would want to replicate the commands for the different waves on my appended dataset which would require looping.
Also, I want to obtain the result for a particular command for the different quintiles in the waves.

Please, I need help on how best to go about it.

Thanks

↧

Trying to run regression discontinuity with rdrobust: getting an error. How to solve it?

October 14, 2019, 10:33 am

≫ Next: Averaging IV parameters across samples

≪ Previous: Help with Looping

I have a dataset with electoral information from Brazilian municipalities. Two of the variables in the dataset are incumb_vs (i.e., incumbent's vote share, the dep. variable) and margin_victory_08_16 (i.e., margin of victory, the running variable). I'm trying to run a very simple RD model: rdrobust incumb_vs margin_victory_08_16. However, I'm getting the following error message: "c() should be set within the range of margin_victory_08_16". I don't know what the message means and I don't know how to overcome the error. Could someone help? Thanks.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(incumb_vs margin_victory_08_16)
 .4463837   .1072326
        .  .06513521
 .2757941   .4484118
 .7278874   .4694826
        .   .1651098
 .4189983   .1620034
 .3696054 .018803954
        1          1
        . .002683848
        .   .2456478
 .5187292  .03745845
        .  .15589768
        .   .5348935
        .  .19441482
  .647213  .29442596
        .  .09627774
.56946826  .13893652
        .  .14316547
 .9585735   .9171469
        .  .13237998
        .   .1262584
 .3655428    .107598
 .4486495  .10270107
        .  .06396064
        .    .255861
.52742016  .05484033
        .   .3577933
.19271736  .09784493
.05049327  .04607087
        .  .11850062
        .  .11329326
 .3453006  .14553154
        .   .5612041
  .626775  .25354996
        .   .4624514
 .6529631   .3475192
        .  .06402877
 .4378324  .11344722
        .  .20089284
        .  .04649273
        .  .02495596
        .   .2476136
 .0914439   .4831984
 .7541108   .5082216
        .  .03328356
 .4611361  .13532057
        . .017568856
 .4515343 .019608885
.59505874   .4103114
        .  .05767663
.21464506 .009585947
        .   .0386377
 .3341631 .070450604
 .4657904  .06841931
        .  .02782464
        .   .3176471
 .4657154  .06856918
        .    .304785
        .  .42265385
        . .016479075
        .  .04418364
        .  .28573948
        .  .03994054
        .  .11620066
 .5002291  .08930752
        .  .11472383
        .   .2721927
 .3734521  .16268027
 .3643772   .1526692
 .3518482  .03570038
        .    .449583
        .  .52526426
 .7604086  .52081716
        .  .22312585
.38770685   .2245863
 .6590587   .3181174
        .  .01237905
        .   .3734111
 .3984553  .20308948
        .  .18501386
        .  .24438795
        .    .231325
.12899198  .08105856
        .  .03966752
        .  .17668983
        .  .05595085
 .3665719  .26685628
        .   .3366337
        .  .12625697
        .   .3997429
        . .066535145
.52088904  .04177809
        . .067629874
 .2427203   .1150383
.24732536 .003225297
        . .013636038
 .7063679   .4127358
        .  .18246344
 .4816661 .036667824
.16857342    .571562
end

↧

Averaging IV parameters across samples

October 14, 2019, 10:34 am

≫ Next: Stacked bars with percentages

≪ Previous: Trying to run regression discontinuity with rdrobust: getting an error. How to solve it?

Hi statalist,

I am looking into ways to average IV estimates across different sample ( and test their significance). To simplify my problem, let's say I have a variable X taking two value (1 and 2). I implement an IV regression for each value of X and would like to calculate the average effect. I though that I could do that:

ivreg Y (D=Z*X Z) X

with D the endogenous variable and Z the instrument and Y the dependent variable.

As it turns out, when stacking both IV regression to compare both estimate as advised here
https://www.stata.com/statalist/arch.../msg01493.html

I do not find the same results but they are very close. Here is an example with Z=tenure D=hours X=south

est clear
sysuse nlsw88, clear
* estimation without interaction
ivreg wage (hours=tenure ) if south==1
est sto south1
sca n1=e(N)
ivreg wage (hours=tenure ) if south==0
est sto south0
sca n0=e(N)

preserve
expand 2
bys idcode: g n=_n-1
keep if (n==0&south==0)|(n==1&south==1)

forval k=0/1 {
foreach j in tenure hours south {
g `j'`k'=`j'*(n==`k' | south==`k')
}
}

ivreg wage (hours?=tenure?) n, cl(idcode)
lincom n0/(n1+n0)*_b[hours0]+n1/(n1+n0)*_b[hours1]
// gives an average effect of .5742095

est sto stacked
restore
esttab south1 south0 stacked, nogaps mti
gen inter=tenure*south
xi:ivreg wage (hours=tenure inter) south, cl(idcode)
// gives .5746572

It is very close but not quite the same. Does it make sense for you that the IV regression with interaction term should give the same results than the stacked IV reg? If yes, why don't I find the same results?

Thanks for your inputs

↧

Stacked bars with percentages

October 14, 2019, 10:53 am

≫ Next: Extracting specific text from large string entry

≪ Previous: Averaging IV parameters across samples

Array

The image shows a graph which should allow visual comparison of the percentage of success between 3 treatments. I used the menu system and have ended up with this graph although I wish to have the bars side by side in one frame. However, as soon as I change anything, I end up with unstacked bars and/or the y axis changes to show the number of observations instead of the percentage. How can I get the three stacks of bars amounting to 100 % to be shown side by side in one graph?

My current code is:

graph bar, over(kolRes) asyvars stack by(, title("Success, by Treatment")) ytitle("Percentage") legend(order(1 "Success" 2 "Failure" 3 "Unknown")) by(Treatment_type)

Bonus question: How do I get rid of the footnote about "Graphs by.."? I have tried to make its opacity 0%, but it still comes through in the .wmf file I use for export to PowerPoint.

Thanks, Hans

↧

Extracting specific text from large string entry

October 14, 2019, 11:03 am

≫ Next: Add non-positive constraints to regression

≪ Previous: Stacked bars with percentages

I am seeking help for extracting specific data from a large string entry (strL). I have a list of names and their institutional affiliation separated by semi-colons. I am seeking names of individuals from a particular institution so that I do not have manually seek those names. I am attaching a sample in excel. I have used the "parse" command but this separates the entries by identifiers and creates new columns. Sometimes this creates over 30 new columns and it is very hard to track it manually.
Would be very grateful for any help. Thanks in advance

↧

Add non-positive constraints to regression

October 14, 2019, 1:20 pm

≫ Next: interpreting Stata interaction terms

≪ Previous: Extracting specific text from large string entry

I am trying to constrain several regression coefficients to be non-positive (generally negative). How can I do this with either the nl or constraint commands? I find lots of advice for how to specify non-negative constraints, but not the opposite.

For reference, here is what I have tried so far, which returns an error:

constraint 1 inj1 < 0
cnsreg hrql inj1, constraint(1)

↧

interpreting Stata interaction terms

October 14, 2019, 1:26 pm

≫ Next: One bar graph with multiple yvars, by variable

≪ Previous: Add non-positive constraints to regression

I am hoping to confirm my interpreting and application of the interaction terms Stata provides when we run the var1##var2##var3 regression format.
My regression command is
xtreg ff D1event##D2style##D3rating
where ff = fund flows. My coefficients look like this

VARIABLES

D1event = 1	-9
D2style = 1	-3
1.D1event#1.D2style	-7
D3rating = 1	5
1.D1event#1.D3rating	-1
1.D2 style #1.D3rating	-14
1.D1event#1.D2style#1.D3rating	15
Constant	22

I am hoping to confirm that my interpretation and application of the coefficients are correct. My purpose is to:
1. estimate the unique ff (ie fund flows) associated D2 style fund with a D3rating in a D1 event.
Given the following outputs, am I correct to interpret the 1.D1event#1.D2 style#1.D3rating coefficient as indicating that 15 is estimated to occur in a D1event (compared to a non-D1event) for a D2style fund (compared to a non D2style fund) if the fund has a D3 rating (compared to not having a D3 rating).
2. explain how the interaction components fit together. What is the estimated total ff (ie fund flow) for a D2 style fund with a D3rating in a D1 event?
Is it the sum of all the coefficients? Ie Constant + D1event + D2style + 1.D1event#1.D2style
D3rating + 1.D1event#1.D3rating + 1.D2 style #1.D3rating + 1.D1event#1.D2style#1.D3rating
Thank you for your help, Dan

↧