Quantcast
Channel: Statalist
Viewing all 65116 articles
Browse latest View live

Identify if single cell contains multiple substrings

$
0
0
Hi,

I'd like to identify observations when a string variable contains multiple substrings.

In my dataset, I have a string variable containing crime descriptions. For that string variable, I want to identify when "REG" and "GUN" appear together in the same cell.

The data is messy and not standardized. Here is an example of how some of the strings look:

crime
GUN OFFENDER REGISTRATION
GUN OFFENDER-FAIL TO REGISTER
GUN OFFENDER/FAIL REG OFFENDER
GAS/AIR/PAINTBALL GUN: POSSESS
FIRING HANDGUN IN CITY LIMITS
FRAUDULENT POSSESSION OF VEH OWNERSHIP REG. PLATE
KNOWINGLY HOLDING FALSIFIED VEH. REG. PLATE

I've successfully used the strpos command to isolate observations containing a single substring, i.e. :

l if strpos(crime, "REG")
l if strpos(crime, "GUN")

And I've been able to identify observations that contain either one substring or another, i.e. :

l if strpos(crime, "REG" "GUN")

But I haven't been able to figure out how to identify if a single cell contains both "REG" and "GUN".

Any advice is appreciated.









marginsplot with custom labels

$
0
0
Hi,

I saw a discussion about this topic here: https://www.statalist.org/forums/for...-custom-labels

I am trying to ask the question there but it seems nobody saw it.

I have a related question about custom label with marginsplot.

I am trying to set up something like this to shorten the display of the label:

marginsplot, recast(line) recastci(rarea) ytitle("Cardiovascular Risk Index") ///
xtitle("Longtidinal Obesity") xlabel (`=none' "none" `=obese@adulthd' "@adult" `obese@early-adlt' "@early-adult" `=obese@adolsct' "@adolescent")

But it did not work. Do you know how I can fix it? BTW, how can I make the label to be fully shown but not cut-off

Of course, there is another option. I could re-define the value label. But I would like to try changing the label in the graph first.

Thanks!

Alice

Set title for the second y-axis in scatter chart with by()

$
0
0
Dear Statalisters,

I plot the chart as below

Code:
twoway (scatter var1 rev, sort) (scatter var2 rev, sort yaxis(2) msymbol(square_hollow)), ///
ylabel(0(100)500) by(, note("")) by(, legend(off)) by(periods) ytitle("1st title", axis(1)) ytitle("2nd-title", axis(2))
But the y-title for the second axis does not appear. I tried many options in the dialog but it seems everything does not work.

Could someone show me what is wrong with my plotting code? Note: I use Stata 15.

Many thanks.

Array

generate variable time to event. Several measures in time.

$
0
0
Please, I need to create a variable event and time to event. But I have different measures. The variable is pcrresult and its value is 1


----------------------- copy starting from the next line -----------------------
Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input int id float datepcrrequest2 byte pcrresult2 float datepcrrequest3 byte pcrresult3 float datepcrrequest4 byte pcrresult4 float datepcrrequest5 byte pcrresult5
 1 20265 2     . .     . .     . .
 2 19785 2 20996 2     . .     . .
 3 18379 2 18393 2     . .     . .
 4 18582 2 19841 2     . .     . .
 5 19802 2 20170 2 20520 2 20730 2
 6 19926 2 20306 2 20488 2 20801 2
 7 18624 2 18805 2     . .     . .
 8 18568 2 18833 2 19183 2 19456 2
 9 20247 2 20394 2 20772 2     . .
10 20076 2 20436 2 20773 2     . .
11 20044 2 20401 2 20765 2 21182 2
12 19988 2     . .     . .     . .
13 18543 2 18550 2 18564 2 18578 2
14 19799 2 19981 2     . .     . .
15 20202 2 20384 2 20906 2 21252 2
end
format %tdDD/NN/CCYY datepcrrequest2
format %tdDD/NN/CCYY datepcrrequest3
format %tdDD/NN/CCYY datepcrrequest4
format %tdDD/NN/CCYY datepcrrequest5
------------------ copy up to and including the previous line ------------------

Listed 15 out of 389 observations

.

Is it possible to destring a variable which starts with a letter?

$
0
0
Hello, I am using Parliamentary Constituency codes in my dataset which begin with either E, S, N, W dependent on if the constituency is in England, Scotland, Northern Ireland or Wales respectively. The variable type is a string. I have made several attempts to destring the variable in the usual way however have been unsuccessful. I suspect this is because of the letter at the start of each observation? I was wondering whether there are possibly any suggestions as to overcome this? Many thanks.

displaying the contents of a macro

$
0
0
Hi all: I am using "macro list" to display the contents of all the macros I have created (locals and globals). But what command to I use, or what option, to display the contents of just one macro, or just two? I have not been able to figure this out...

Bug fix for estimates table and estout/esttab in Feb 20 update of Stata 15

$
0
0
In the latest update of Stata 15 (from Feb 20, 2019), models with ancillary parameters can cause problems in estimates table and estout/esttab. Example:

Code:
. webuse womenwk

. qui heckman wage educ age, select(married children educ age) twostep

. est sto m1

. qui heckman wage i.educ age, select(married children i.educ age) twostep

. est sto m2

. estimates table m1 m2
/mills not found
r(111);
The problem is due to a change in command mat_capp, which is used by estimates table and estout/esttab to merge models.

Jeff Pitblado (Stata Corp) gave me the following explanation + workaround until a fixed Stata update is released:

The change [...] was intentional [...], it also fixed a problem -estimates
table- was having with -sem- results. I figured out where the problem in
-mat_capp- is and have a fix. I do not know yet when the next ado-file update
in Stata 15 will happen, but I will push for one soon.

In the mean time, I've attached the "fixed" mat_capp.ado file. Just put it in
the ado/base/ folder (but not in the m/ folder). Start up a new Stata 15
session and -estimates table- should work properly. This file will get replaced
by the officially "fixed" one in the next update.


The mentioned file is attached.

ben

Frequency bar graph for multiple binary variables

$
0
0
I have a string variable (Reasons_Inadmissible) with multiple values that are comma separated. It lists all of the reasons why a case is inadmissible; multiple reasons can be listed in one cell. For example, one cell will read HRM, another will read HRM, LAW, another will read LAW, etc. Other values include REP, LON, OTE and others. I am trying to produce a frequency bar graph which indicates the number of cases where Reasons_Inadmissible is HRM, the number of cases where Reasons_Inadmissible is LAW, etc. In other words, I want a bar chart to indicate the number of cases that invoked a particular reason. Individual cases can be double-counted in this bar chart. I tried:

catplot Reasons_Inadmissible

But, it produces a bar chart with all possible combinations (i.e. HRM; HRM, LAW; HRM, OTE; LAW; OTE). Instead, I want the bars to represent the number of cases where one of the reasons was listed (just HRM, LAW, REP, OTE-- individually).

One options would be to generate a series of binary variables that are equal to 1 for each Reasons_Inadmissible value. For example:
gen HRC = 1 if strpos(Reasons_Inadmissible,"HRC")

gen LAW = 1 if strpos(Reasons_Inadmissible,"LAW")

However, in that case, how do I generate a bar graph, where the bars are counts of the different binary variables? Alternatively, is there another way you would recommend generating this graph from a comma-separated categorical variable?

Thanks!

Erica

data with both ICD9 and ICD10 - please help

$
0
0
Hi,

I have a PHC4 dataset that has both ICD9 and 10 codes merged in the same variables (for example admitting diagnosis, billing diagnoses). I've been trying to clean up the data a little and have been having difficulty doing so with the merged data. I'm ultimately trying to see the most common diagnoses for descriptive studies, but also be able to organize it better so I can run regression analyses.

for example:
the below code works
"icd10 generate admdescr = admdx, description" and created a new variable with descriptions
but
"icd9 generate admdescr = admdx, description" does not as it states there are variables that are not ICD9 codes (which is true, although the ICD10 version worked)

I thought about trying to divide the data into a ICD9 vs ICD10 section to clean it up, and then when I re-merge it just create new variables (aki, dm, etc) to help with the regression. I'm not sure if thats the best or most efficient method. I've tried reading the official STATA ICD help materials and it hasnt helped.

Any advice would be appreciated

local interpolation

$
0
0
Dear Statalisters
I was wondering if anyone has written or knows of a function that obtains the best interpolation based on two series where we know that one is a smooth transformation of the other.
Say, for example, that y=F(X), we do not know F, but we know its an strictly increasing and Smooth function so that every value of X is paired with a unique value of Y.
What i need know is that given that i have the data for Y and X, i can obtain the "correct" y for any value X of my choice.
To be more concrete. see the example below
Code:
sysuse auto, clear
gen lnprice=ln(price)
sum lnprice
gen Fprice=normal((lnprice-r(mean))/r(sd))
* now we know that there is a smooth and unknown function F that transforms price into Fprice.
* Now, how do I obtain the best interpolation of Fprice for, say, price=5000?
Suggestions are appreciated.
Thank you

Interactions with time in linear mixed models (repeated data)

$
0
0
Hi all,

I'm working on the impact of blood pressure variability on cognitive function over time. I runned the following model:

mmse : cognitive function (MMSE test)
zcv_sbp : CV% of systolic blood pressure variability (visit-to-visit variability). Measured every 6 months.
time : from 1 to 7 (visit 1 to visit 7 every 6 months)

xi:xtmixed mmse zcv_sbp age sexe education [other confounders] time || id : time

In this model, the beta associated with mmse is -0.6, p=0.01 so I can say that whatever the time, per 1-SD increase in systolic blood pressure variability, cognitive performances are lower (-0.6).

I checked the interaction with time :

xi:xtmixed mmse c.zcv_sbp##c.time || ctrpat : time

The coefficient of c.zcv_sbp#c.time is -0.004 but p=0.78 so not at all significant.

I just wanted to be sure that I'm allowed to say that the negative effect of systolic blood pressure variability is the same over time. So patients with an elevated variability have lower cognitive performances but they don't have a greater cognitive decline over time compared to patients with a lower variability.

I was expecting a cognitive decline in this population because I've also done a cox model looking at incident dementia and patients with a high blood pressure variability have a higher risk of developing dementia.

It's weird not to be able to show that they have a greater cognitive decline over time.

Basically, I just wanted to be sure, that if the interaction like I've done is not significant, I can conclude that the effect of variability is the same over time. Just based on the non significant p value of the interaction at 0.8?

Thank you so much +++ for your valuable help.

I'm not very familiar with liner mixed models at all...

Javier

Arellano-Bond test AR(2)

$
0
0
Dear Statalisters,

I have one doubt about the result of my Arellano-Bond test for autocorrelation in a gmm estimation.
The Hansen test of joint validity is ok with a p-value bigger then 10% but not too high.
The number of instruments is also ok, I have N=95 & T=20 and around 20 instruments.
However the AR(2) has been giving me p-values of order 0.7 to 0.8. Until I know de AR test is not weaken as Hansen, so i don't know if these p-values are reliable or not.

I'll really appreciate some light in this point. Thanks.

. xtabond2 empshare L.empshare ln_gdppc ln_gdppc_2 ln_pop ln_pop_2 i.year2,
> gmmstyle (L.empshare, lag(1 5)collapse)
> iv(ln_gdppc ln_gdppc_2 ln_pop ln_pop_2 i.year2) orthogonal twostep robust

Arellano-Bond test for AR(1) in first differences: z = -3.18 Pr > z = 0.001
Arellano-Bond test for AR(2) in first differences: z = -0.26 Pr > z = 0.791
------------------------------------------------------------------------------
Sargan test of overid. restrictions: chi2(4) = 14.17 Prob > chi2 = 0.007
(Not robust, but not weakened by many instruments.)
Hansen test of overid. restrictions: chi2(4) = 6.33 Prob > chi2 = 0.176
(Robust, but weakened by many instruments.)

Difference-in-Hansen tests of exogeneity of instrument subsets:
GMM instruments for levels
Hansen test excluding group: chi2(3) = 5.11 Prob > chi2 = 0.164
Difference (null H = exogenous): chi2(1) = 1.22 Prob > chi2 = 0.269

Cumulative sum by order and id

$
0
0
Hi all,

I am working on this dataset, in which I would like to test whether the cumulative sum of the variable 'votes_margin' for order_m = 2 and 3 is greater than order_m of 1.

1. Therefore, for each id indicated by the variable 'ac', I would like to create a cumulative sum of the variable 'votes_margin' but only for the values of order_m of 2 and 3.
2. Then, create a dummy indicating whether this cumulative sum of order_m values 2 and 3 is greater than or lesser than order_m of 1, for each id.

Any suggestions would be helpful.
I have provided the example dataset by using dataex, below for your kind verification.

Thanks.


Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input str14 state_m str30 constituency float year_m str55 candidate_m double votes_margin long ac float order_m str24 const_m
"Andhra Pradesh" "  ACHAMPET (SC)"  2004 " DR.VAMSHI KRISHNA"           .538547933101654  2 1 "Achampet" 
"Andhra Pradesh" "  ACHAMPET (SC)"  2004 " P.RAMULU"                  .36918625235557556  2 2 "Achampet" 
"Andhra Pradesh" "  ACHAMPET (SC)"  2004 " YAMGONDI VENKATAIAH"      .046903301030397415  2 3 "Achampet" 
"Andhra Pradesh" "  ACHANTA (SC)"   2004 " PEETHALA SUJATHA"           .4994221329689026  3 1 "Achanta"  
"Andhra Pradesh" "  ACHANTA (SC)"   2004 " ANAND PRAKASH CHELLEM"     .43905702233314514  3 2 "Achanta"  
"Andhra Pradesh" "  ACHANTA (SC)"   2004 " JOSHIP MERIPE"            .043371714651584625  3 3 "Achanta"  
"Andhra Pradesh" "  ALAIR (SC)"     2004 " DR. KUDUDULA NAGESH"        .5575404167175293 12 1 "Alair"    
"Andhra Pradesh" "  ALAIR (SC)"     2004 " MOTHUKUPALLY NARSIMHULU"    .3478609621524811 12 2 "Alair"    
"Andhra Pradesh" "  ALAIR (SC)"     2004 " DR. ETIKALA PURUSHOTHAM"   .02296549640595913 12 3 "Alair"    
"Andhra Pradesh" "  ALLAVARAM (SC)" 2004 " GOLLAPALLI SURYARAO"        .5028068423271179 14 1 "Allavaram"
"Andhra Pradesh" "  ALLAVARAM (SC)" 2004 " PANDU SWARUPA RANI"         .4317871034145355 14 2 "Allavaram"
"Andhra Pradesh" "  ALLAVARAM (SC)" 2004 " ETHAKOTA THUKKESWARA RAO"  .04949498176574707 14 3 "Allavaram"
end
label values ac ac
label def ac 2 "  ACHAMPET (SC)", modify
label def ac 3 "  ACHANTA (SC)", modify
label def ac 12 "  ALAIR (SC)", modify
label def ac 14 "  ALLAVARAM (SC)", modify

Hardware for Stata double-digit-core version

$
0
0
I think I will upgrade my Stata to 32-core or more.

But I am wondering what hardware I have to buy. Currently I have 10-core processor desktop. (I have this)

My understanding is that I have two options.
(1) Buy more graphic processing unit.
(2) Buy a new motherboard that will allow multiple processors. Buy 3 or 4 10-core processors.

Is there any other option other than these two?

I heard Matlab GPU computing requires me to write m-file code in somewhat differently. Is that the same for Stata? If I have to write a very different do file just because I am using GPU, that will be cumbersome.

Even if (1) and (2) achieve the same number of cores, will (2) be faster than (1)? What about collapse, probit, and bysort commands?

Thank you!

Problem with negative values in log-transformation

$
0
0
Negative values and zeros are deleted in log-transforming. How can negative values and zeros be log-transformed without losing? Is it wise to make them all positive by adding equal positive numbers to the entire observations before log-transformation? I have learned from answers on my last question about log transformation of ratio variable that it is not a good idea to add value into original values. However, a log transformation of negative values has a different issue (missing).

Specifically, I want to log-transform x in the below in order to address the potential problem of outliers. In this case, in my field, log(x+6 [the smallest negative number]) is a typical choice. Do you agree with this? Or, do you have any other suggestion? I provide detailed information on variable x as follows.

Code:
sum x, det
x
-------------------------------------------------------------
Percentiles Smallest
1% -5 -6
5% -1 -6
10% -1 -6 Obs 712
25% 0 -6 Sum of Wgt. 712

50% 0 Mean .2373596
Largest Std. Dev. 1.21111
75% 1 4
90% 1 4 Variance 1.466788
95% 2 5 Skewness -.8423147
99% 4 5 Kurtosis 10.86292

Code:
graph box x
Array


Code:
dataex x
----------------------- copy starting from the next line -----------------------
Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input float x
 1
 0
 1
 0
 0
 0
 0
-1
-1
 0
 0
 0
-1
-1
-3
 0
 0
 0
 0
 1
 1
 0
 1
 1
 0
-2
-2
-2
 0
 0
 0
 0
 0
 1
 1
 1
 1
 0
 0
-1
 0
 0
 0
 0
 0
 0
 0
 1
 0
 1
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 1
 0
 0
 0
 0
 0
 0
 0
 0
 1
 1
 0
 0
 0
 0
 0
 0
 1
 1
 2
 2
 0
 1
 2
 0
 0
 0
 0
 0
 0
 0
 1
 0
 0
 0
 0
 0
-3
-3
-1
end
------------------ copy up to and including the previous line ------------------

Listed 100 out of 712 observations

Interpretation of K-fold cross-validation's results

$
0
0
Dear all,

Hope you all are doing well.

I have graphical results of K-fold cross-validation checks and I am wondering how to interpret those results precisely. In fact, the K-fold cross-validation checks were used to compare models performance of counts data, including Negative Binomial Regression I (NB1), Hurdle-NB1, Hurdle-NB2, and Zero-inflated NB2. Followings are author's interpretations "Figure 8.7 shows a comparison of the NB models for the number of office-based visits. Although the NB2 model is the worse performer for each replicate, its hurdle counterpart performs quite well - either the best or close to the best-performing model" By Deb, Norton, and Manning.

Array

Note: NB1 is the first bar, Hurdle-NB2 is the second, Hurdle-NB1 is the third, and Zi-NB2 is the fourth. The figure is taken from Health Econometrics Using Stata Book written by Deb, Norton, and Manning.

Results of log likelihood, AIC, and BIC are presented below.
Code:
                        K      LogLik         AIC         BIC
     Poisson          37  -10682.042   21438.085    21729.36
         NB2          38  -9995.2177   20066.435   20365.583
         NB1          38  -10020.414   20116.828   20415.976
Hurdle_Poi~n          74  -10118.265   20384.531   20967.082
  Hurdle_NB2          75  -9947.3831   20044.766   20635.189
  Hurdle_NB1          75  -10057.127   20264.254   20854.677
         ZIP          74  -10113.589   20375.177   20957.728
       ZINB2          75  -9937.9712   20025.942   20616.365
Could anyone help me understand the figure more clearly?

Thank you and have a nice wk!

DL

How to write shorter

$
0
0
Hi reader,

I would like to know how I could make these formulas a bit shorter, hope you can help.

Formula 1:

retX and retcX are in columns, firms in rows

gen ar2=ret2-retc2
gen ar3=ret3-retc3
gen ar4=ret4-retc4
gen ar5=ret5-retc5
gen ar6=ret6-retc6
gen ar7=ret7-retc7
gen ar8=ret8-retc8
gen ar9=ret9-retc9
gen ar10=ret10-retc10
gen ar11=ret11-retc11
gen ar12=ret12-retc12
gen ar13=ret13-retc13
gen ar14=ret14-retc14
gen ar15=ret15-retc15
gen ar16=ret16-retc16
gen ar17=ret17-retc17
gen ar18=ret18-retc18
gen ar19=ret19-retc19
gen ar20=ret20-retc20
gen ar21=ret21-retc21
gen ar22=ret22-retc22
gen ar23=ret23-retc23
gen ar24=ret24-retc24
gen ar25=ret25-retc25
gen ar26=ret26-retc26
gen ar27=ret27-retc27
gen ar28=ret28-retc28
gen ar29=ret29-retc29
gen ar30=ret30-retc30
gen ar31=ret31-retc31
gen ar32=ret32-retc32
gen ar33=ret33-retc33
gen ar34=ret34-retc34
gen ar35=ret35-retc35
gen ar36=ret36-retc36
gen ar37=ret37-retc37
gen ar38=ret38-retc38

Formula 2:

gen ret3year= ((1+ret2)*(1+ret3)*(1+ret4)*(1+ret5)*(1+ret6)*(1+r et7)*(1+ret8)*(1+ret9)*(1+ret10)*(1+ret11)*(1+ret1 2)*(1+ret13)*(1+ret14)*(1+ret15)*(1+ret16)*(1+ret1 7)*(1+ret18)*(1+ret19)*(1+ret20)*(1+ret21)*(1+ret2 2)*(1+ret23)*(1+ret24)*(1+ret25)*(1+ret26)*(1+ret2 7)*(1+ret28)*(1+ret29)*(1+ret30)*(1+ret31)*(1+ret3 2)*(1+ret33)*(1+ret34)*(1+ret35)*(1+ret36)*(1+ret3 7))-1

gen retc3year= ((1+retc2)*(1+retc3)*(1+retc4)*(1+retc5)*(1+retc6) *(1+retc7)*(1+retc8)*(1+retc9)*(1+retc10)*(1+retc1 1)*(1+retc12)*(1+retc13)*(1+retc14)*(1+retc15)*(1+ retc16)*(1+retc17)*(1+retc18)*(1+retc19)*(1+retc20 )*(1+retc21)*(1+retc22)*(1+retc23)*(1+retc24)*(1+r etc25)*(1+retc26)*(1+retc27)*(1+retc28)*(1+retc29) *(1+retc30)*(1+retc31)*(1+retc32)*(1+retc33)*(1+re tc34)*(1+retc35)*(1+retc36)*(1+retc37))-1

Formula 3:

I need to count days per firm, since there are over 1000 different firms, what would be a smart command to generate the counted days?

Appreciate the help

Extracting coefficient from matrix after bayes command

$
0
0
Can anyone please suggest how I can extract the coefficient of my explanatory variable after using the Bayes command? Example below

Code:
sysuse auto, clear
bayesmh foreign trunk mpg, likelihood(logit) prior({foreign:}, normal(0,1000))

mat list e(mean)
result of mat list
Code:
e(mean)[1,3]
trunk    mpg    _cons
Mean  -.13484497    .11758344    -1.7574044
I want to extract mean of trunk and mpg only in a postfile but not sure how to define the matrix for each of the explanatory variables.

My try shows an error message:
Code:
postfile examp trunk using "bay.dta", replace
bayesmh foreign trunk mpg, likelihood(logit) prior({foreign:}, normal(0,1000))
post examp (e(mean))
postclose examp
type mismatch
post: above message corresponds to expression 1, variable means
r(109);

any suggestion is appreciated

flag certain observation

$
0
0
Dear reader,
I would like to flag every 24th observation starting from fifth observation onwards. The following formula is not working. How should I alter it?

gen flag=0
forvalue 1/$N{
replace flag=1 if (obs==_n+4+_n*28)
}

Kind regards

Heteroskedasticity for Random Effects

$
0
0
How can I test Heteroskedasticity in a Random Effects panel?
I know xttest3 but it doesn't work in RE.
Viewing all 65116 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>