Quantcast
Channel: Statalist
Viewing all 65481 articles
Browse latest View live

To append or to merge

$
0
0
I would like to create a long panel from the 17 waves of panel data from the HILDA survey but am struggling with two key issues:
(1) code to append (or merge) the 17 waves, and
(2) dealing with the variable names, which include a prefix letter representing the number wave, e.g. ahgage - age in wave 1, bhgage - age in wave 2, chgage - age in wave 3, hence prefix 'a' = 1, 'b' = 2, etc. In this case, I believe I can use rename group (?) to remove the wave reference, but not sure of the exact code, and secondly, would I do so while appending the waves?

Most of the variables are in all waves (such as age, gender, education, wages), however there are three (categorical) variables only in waves 4, 7, 10 , 14 (hence the separate append).

My code for appending the waves follows:

clear
set memory 1g
set more off
use "C:\data\Combined_a170c.dta" // original data file
keep xwaveid ahgsex ahgage aeduc awage

tempfile master
save "`master'", replace // save to temp data file

// add in data from other waves (in and not in wave 1)

* Cleaning data for waves 2 to 17 // (excl waves 4, 7, 10, 14 as these include specific data not in other waves)

local wave b c e f h i k l m o p q
foreach x of local wave {
use "C:\data\Combined_`x'170c.dta", clear
keep xwaveid `x'hgsex `x'hgage `x'educ `x'wage
save "`master'", replace

use "C:\data\Hilda\Combined_`x'170c.dta", clear
append using "`master'"
save "C:\data\basedata.dta", replace // new data file
}

* Cleaning data for waves 4, 7, 10, 14

local wave2 d g j n // (waves 4, 7, 10, 14)
foreach y of local wave2 {
use "C:\data\Combined_`y'170c.dta", clear
keep xwaveid `y'hgsex `y'hgage `y'educ `y'wage `y'reltype `y'relimp `y'relat //
save "`master'", replace

use "C:\data\Combined_`y'170c.dta", clear
append using "`master'"
save "C:\data\basedata.dta", replace
}

//
Stata responded with "invalid file specification"
r(198);


Your kind assistance is appreciated as always.

Stata job opportunity in Cmabridge MA

$
0
0

NBER has an opening for a "professional to work with researchers to explore, procure, & curate data
from various sources, make them “research ready” and help researchers use the data" which requires
Stata proficiency along with a general purpose language. See https://www.nber.org/jobs/employment_opp.html

Read list of numbers from txt to use in a macro

$
0
0
I have a txt file containing some numbers, say 1 5 6 17 18. I would like to import this to Stata and use those numbers as the steps in a four loop. Manually I can copy and paste the values

Code:
use data, clear

foreach i in 1 5 6 17 18 {
   % do something with the i's
}

Essentially I would like to programmatically do the same thing. Here is how I believe the ingredients in Stata should look like, but I have failed to properly finalize it:

Code:
import delim using "text.txt", delim(" ")
local `steps' transform data to local variable

use data, clear

foreach i in `steps' {
   % do something with the i's
}
The bold italic part is what I struggle with. Any ideas?

Covariance structure in generalized mixed models

$
0
0
Dear all,

Using longitudinal data at prefecture level, I conducted multilevel analyses with a random-intercept(prefecture) using menbreg command. I would like to identify autoregressive covariance, but menbreg command has no options about that, whereas "mixed", linear-regression model has. The covariance() options for menbreg seems for no meaning if the model contains one random intercept. Could you please tell me how I can take correlation structure into a generalized mixed model?

Sincerely,
Mariko

Marginal effect for censored ordered probit

$
0
0
I develop a censored ordered probit model to explain the level of education achieved. I should also deduce the associated marginal effects. Please help me if someone has already done it under Stata. Thank you in advance.

Convex hulls on scatter plots

$
0
0
A search for discussions of convex hulls in Stata forums or outlets reveals various programs from 1995, 1997, 1998 in the Stata Technical Bulletin (all using Stata's old graphics), an ado file cvxhull posted by Allan Reese in 2004

https://www.stata.com/statalist/arch.../msg00193.html

and not much else. This is a surprise to me.

In what follows, my focus is entirely on what you can do on or with scatter plots -- or points in two dimensions -- and not with one dimension or with three dimensions or more.

For whatever reasons, convex hulls no longer seem popular or even known about in statistical graphics.

I should back up, as some people will already be lost if they do not know about convex hulls, or at least do not know the term. The idea is likely to be familiar or at least immediate once exemplified and it may summon distant memories of childhood pastimes in which you connected the dots and Cinderella, or a horse, or something equally interesting emerged from a puzzle book.

Here is a convex hull as produced by

Code:
ssc install cvxhull
sysuse auto, clear 
set scheme s1color 
cvxhull mpg weight, hull(1) noreport

Array

So, a convex hull is the smallest convex polygon including all the points in a set. Some points are on the hull and the others are inside.

A standard thought experiment is to imagine the points on the scatter plot as pins on a board. Summon up a giant rubber band (https://en.wikipedia.org/wiki/Rubber_band), stretch it to include all the points, and then let it go. The hull is now marked by the band.

OK, but why should you find this interesting or useful? It's when there are two or more groups that this becomes of note. I will show some more results before giving the small sales pitch, although if you need the pitch after the pictures, then I have probably failed.

cvxhull does not (does not promise to) give you all you may find helpful but it does leave behind variables that are essential for further processing. Each hull is presented by two variables defining different sides of the hull.

Thus we can do things like this:



Array Array


To spell it out:

0. This is pretty easy to explain. In my experience, the story of pins on a board nails it easily for people new to the idea. Thanks to cvxhull it is easy to implement.

1. Convex hulls look good shown as areas contained. This enhances perception of point patterns as wholes.

2. Transparency as introduced in Stata 15 is invaluable whenever, as will be common in interesting cases, hulls overlap.

3. If the reaction is that the hull is unduly influenced by outliers -- indeed being on the hull is one way to identify outliers -- then we can carry out peeling. Onion-like, inside each convex hull lies another that is the convex hull of the remaining points (until we run out of data points). The second graph shows the second hulls.

4. In the code below getting the sort order right is crucial detail.

5. Old news to some, but orange and blue work well together.

Here is the complete code for the last two graphs:


Code:
sysuse auto, clear
ssc install cvxhull
set scheme s1color 
cvxhull mpg weight , group(foreign) noreport hull(2)
sort weight mpg 
local opts legend(off) aspect(1) yla(, ang(h)) ytitle("`: var label mpg'")

twoway rarea _cvxh1l _cvxh1r weight if foreign, color(orange%20) sort /// 
|| rarea _cvxh1l _cvxh1r weight if !foreign, color(blue%20) sort      ///
|| scatter mpg weight if foreign, ms(Oh) mc(orange)                   ///
|| scatter mpg weight if !foreign, ms(+) mc(blue) `opts' name(G1, replace)

twoway rarea _cvxh2l _cvxh2r weight if foreign, color(orange%20) sort ///
|| rarea _cvxh2l _cvxh2r weight if !foreign, color(blue%20) sort      ///
|| scatter mpg weight if foreign, ms(Oh) mc(orange)                   ///
|| scatter mpg weight if !foreign, ms(+) mc(blue) `opts' name(G2, replace)

Detail: The contact address on the help file for cvxhull is out-of-date. Allan has moved twice since then.

Rename large number of variables

$
0
0
Hi All,

I just downloaded survey data from DHS. The survey has variables that are very clearly labelled but variable names are still very obscure. So I used the -ren- code for multiple variable ie.

Code:
ren (v000-v002) (countcode_phase clust_no hhno)
It achieves the desired result, but I was hoping for a code that is more user friendly or is considered best practice in STATA coding. This would be especially helpful when I start cleaning the older survey rounds and label names change.

Thank you so much.

Lori

generating restricted cubic spline curve using multiple imputed data

$
0
0
Is there a way to generate restricted cubic spline curve in multiple imputed data?

I used multiple imputation by chained equations to generate 20 complete datasets.
The continuous independent variable "spline1" was treated as an imputed variable.

Then I ran the code below after the imputation process.

Code:
mi xeq: mkspline spline1_mksp= spline1, cubic nknots(5) displayknots
mi estimate: stcox i.sex i.ocupation i.ethnicbackgrd i.bmi i.smoking i. spline1_mksp*,strata(group)
mi estimate, hr
mi xeq: levelsof spline1 if inrange(spline1, 0,15)
mi xeq: xblc spline1_mksp1 - spline1_mksp4, covname(spline1) at(`r(levels)') eform reference(0) generate(syst hr lb ub)


But the xb1c command does not seem to work. The error it states is:

"requested action not valid after most recent estimation command. r(321);"


I am using Stata 15.1. Kindly let me know if there is any literature on the same.

Using conditional imputation with mi impute intreg

$
0
0
Can someone please explain me how I can perform a conditional imputation on an interval regression? I am able to run a conditional imputation when specifying a normal regression function, and replacing system missing values with a negative value beforehand. But when I ask for an interval regression (and I can't specify system missing values), I keep getting the error message "conditional(): conditioning variables not nested; [...]"

Code:
replace disas_age_ = -9 if disas == 0

mi impute chained (ologit) volunteer_frequency perceived_income general_health education  ///  
   (regress, conditional(if disas==1)) disas_age_ ///
   (logit) female ///
   (logit) disas , ///  
   rseed(658738) add(2) burnin(10) chaindots
Code:
gen disas_age_l = cond(disas_age_==.,0,disas_age_)
gen disas_age_u = cond(disas_age_==.,90,disas_age_)

mi impute chained (ologit) volunteer_frequency perceived_income general_health education  ///
   (intreg, ll(disas_age_l) ul(disas_age_l) conditional(if disas==1)) disas_age_imputed ///
   (logit) female ///
   (logit) disas , ///
   rseed(658738) add(2) burnin(10) chaindots
The full error code reads:.

conditional(): no complete observations outside conditional sample;
imputation variable contains only missing values outside the conditional sample. This is not allowed. The
imputation variable must contain at least one nonmissing value outside the conditional sample.
-- above applies to specification (regress , conditional(if disas==1)) disas_age_

Stratified cluster randomization with 3 arms and co-variate balance

$
0
0
Hi,

Could anyone help me with conducting a unequal stratified cluster randomization to 3 arms (2 TR and 1CG).

We are running and RCT and have 1048 individuals, spread across 42 blocks. We want to randomly assign 37% to Treatment arm 1, 37% to Treatment arm 2, and 26% to control group. at block-level. Meanwhile we want to ensure co-variate balance on 15-20 variables.

I have 3 questions:
  1. What command can I best use to conduct this randomisation? I've tried command "cvcrand", however it does not seem to allow for more than 2 treatment arms & unequal assignment.
  2. What is the best method to verify co-variate balance after conducting this form of random assignment? I've (successfully) used the "randomizr" command. How can I test accurately for co-variate balance? ANOVA test -according to many recent articles- does not work in our case.
Thank you for the help!

Iris

How to combine values of two variables in same dataset into a single variable?

$
0
0
I have a variable code and another PSUID. Variable code identifies the State and the District of a respondent while PSUID identifies the village. I want to create a VillageCode variable which combines values of code and PSUID. Ex. If code is 1213 and PSUID is 03, then VillageCode should be 121303.
Please advice.
Regards

Stata blocks during simulation

$
0
0
Dear Statalisters

I am using Stata to run simulations to calculate the statistical power of a clinical trial, where I make follow-up time and sample size vary to generate a matrix. Every time I run this simulation after around 1000 loops Stata stops working, although the spinning wheel keeps going and Stata responds to when I click Break. I have tried running the script below using both Stata15 IC and Stata16 MP on a MacBook Pro 2,7GHz 16Gb and on a Windows 10 machine with Intel i5 3GHz 9Gb.

This is the program I use to simulate the dataset. It is a balanced 2-arm trial without attrition with a minimum of 2 visits (month 0 and 3) but can have up to 8 visits (24 months). The outcome I am measuring is well described by a quadratic function and the effects I am using are derived from previous observational studies. I want to analyse the synthetical datasets with a mixed effect model with a random intercept for each participant and random slopes for the effects of visit number and (visit number)^2.


Code:
            clear
            set seed 12345 // to make sure results are reproducible
            capture program drop power
            program define power, rclass
                args arm_size fu delta sigma_u_0i sigma_u_1i sigma_u_2i sigma_e_ij 
                assert `arm_size' > 0  & `fu' > 0 & `delta' > 0 & `sigma_u_0i' > 0 & `sigma_u_1i' > 0 & `sigma_u_2i' > 0 & `sigma_e_ij' > 0
                // CREATE A DATA SET FOR STUDY POPULATION
                drop _all
                set obs `=2*`arm_size''
                gen int obs_no = _n
                g covariate = rnormal(42,1)
                g arm = runiform()<0.5
                // PARTICIPANT LEVEL RANDOM INTERCEPTS
                g u_0i = rnormal(0,`sigma_u_0i')
                // AGE & AGE2 RANDOM SLOPES
                g u_1i = rnormal(0,`sigma_u_1i')
                g u_2i = rnormal(0,`sigma_u_2i')
                // CREATE FOLLOW UP VISITS OBSERVATION FOR EACH PARTICIPANT
                expand (`fu' + 1)
                bysort obs_no: g visit = (_n-1) 
                gen visit2 = visit^2
                // SAMPLE RESIDUAL ERRORS
                by obs_no, sort: gen e_ij = rnormal(0,`sigma_e_ij')
                // CALCULATE STUDY OUTCOME
                g y =     10 + 0.1*visit + (0.001)*(visit^2) + 0.1*covariate + (`delta'/12)*arm*visit + u_0i + visit*u_1i + visit2*u_2i + e_ij
                // ANALYSE SYNTHETIC DATA
                mixed y c.visit##c.arm c.visit2##c.arm c.covariate || obs_no: visit visit2, reml
                test visit#arm visit2#arm
                // RETURN ESTIMATED P-VALUE AND SIGNIFICANCE DICHOTOMY
                return scalar sig = `r(p)' 
                return scalar sig_ = (`r(p)' < 0.05)
                exit
            end
The code I use to set up the program arguments is as below:

Code:
// TRIAL SPECICIATIONS
        local reps 1000                                  // NUMBER OF SIMULATION REPETITIONS
        local fu 3(3)24                                   // FOLLOW UP IN MONTHS
        local arm_size 50(50)400                 // ARM SIZE
        local delta 0.90                                 // EFFECT SIZE (10% REDUCTION IN OUTCOME PER YEAR)
// MODEL SPECIFICITATIONS
        local sigma_u_0i 0.01                           // error term for random intercept
        local sigma_u_1i 1.0e-10                     // error term for random slope for age
        local sigma_u_2i 5.0e-13                     // error term for random slope for age2
        local sigma_e_ij 0.1                             // SAMPLE RESIDUAL ERRORS/observation-level error term
And finally this is the code I am using to loop the program over follow up time and sample size, and save the data:

Code:
// SET UP POSTFILE TO COLLECT SIMUALTION RESULTS
tempfile results
capture postutil clear
postfile handle int arm_size int fu float sig byte sig_ using `results'
// DO SIMULATIONS LOOPING OVER SIZE AND FOLLOW UP
foreach a of numlist `arm_size'{
    display as text "Arm size " as result "`a'"
    foreach d of numlist `fu'{
        display as text _col(4) "Follow-up " as result %3.2f `d'
        forvalue i = 1/`reps'{
            display as text _col(8) "Repetition " as result %3.0f `i'
            quietly power `a' `d' `delta' `sigma_u_0i' `sigma_u_1i' `sigma_u_2i' `sigma_e_ij'
            post handle (`a') (`d') (`r(sig)') (`r(sig_)')
        }
    }
}
postclose handle
use `results', clear
I then use some commands to generate the power per follow up time per sample size and generate a matrix (power per sample size per follow up time)

I have also tried to use simulate instead of running loops for the number of repetitions and it also gets the same bug. I wonder if anyone has any idea to sort this out such as making the script more efficient and less bug prone or do it in a completely different way that would have higher chances of working? The only thing I can think of that may work would be to save the data set of each simulation independently and then merge all the datasets and to generate the matrix but I am afraid this may be an issue related to computational power and the same will happen.

Any help is very welcome!
Filipe

How to compare statistical significance between 2 odds ratios calculated by logistic command

$
0
0
Hi,
I am trying to calculate odds ratio for diabetes based on exercise pattern

Dependent variable (diabetes): 0 :No , 1: yes
Independent variable (exercise): 0: none, 1: R; 2: RW ; 3: RWS ; 4: RWST (R: running, W: walking, S: swimming , T: trekking)
controlled for : age, race, gender, education, income, employment

Here is my code:
PHP Code:
logistic diabetes i.exercise age race gender education income employment 
Linearized
diabetes Odds Ratio Std. Err. t P>t [95% Conf. Interval]
never
R .20983 .2740315 2.74 0.006 .142138 .236655
RW .36915 .254335 -1.00 0.316 .3363361 .421921
RWS .53138 1.0113 2.32 0.020 .156299 .558474
RWST .9190 .8752752 3.57 0.000 .620794 .982926
What I want is to compared Odds ratio of RWS (OR 0.53) and RWST (OR 0.91) where the difference between both of them is significant or not.
Is there any way I can do that?




Thank you for your help.

create diagonal matrix from non-diagonal matrix

$
0
0
Hi all,

It seems that Stata 13 had a simple way to create diagonal matrix from non-diagonal matrix:
diag(Z), Z a matrix, extracts the principal diagonal of Z to create a new matrix.
Apparently you can't do it with Stata 15. Is there a simple alternative?

Thank you,
Stan

Interpretation of VECM results.

$
0
0
Hi all,
I have estimated panel error correction model using –xtpmg- (Stata 15.1)
I’m not sure how to interpret the results.
The DV is the Shadow economy as a % of the GDP and the IV is the unemployment rate.
The data is annual and the model is:
Code:
xtpmg d.Shadoweconomy  d. unemployment  lr(l.Shadoweconomy  unemployment) ec(ec) pmg
The error correction term is -0.06. Does it mean that it takes 17 years to return to long-term equilibrium after a shock in the unemployment rate?
As for the long term coefficient does the interpretation imply that a 1 percentage change in unemployment increases shadow economy in 1.11% in the long term?
same holds for the short-run coefficient, 1% change in unemployment increases shadow economy in 0.09%?
Any response will be greatly appreciated
Long-term PMG
∆.Unemployment 1.11**
Short-term
Error correction term -0.06***
∆.Unemployment
0.09***

Calculating growth rates with panel data

$
0
0
I'm using data that has observations for every country-year-industry.

I want to make a column of 5 year growth rates.

So, say I have 10 countries, 50 years, and 100 industries. I want every country-year-industry to have its own 5 year growth rate.

Some direction would be nice as its my first time with panel data that has 3 identifying columns.

counting unique observations by group

$
0
0
Hi,

I had 2 similar questions I wanted help with. 1) What is an elegant way of counting the number of unique values for a variable by group? For example how I can find out how many villages I have represented in the data set for each country?

And 2) how can I use the levelsof command by group? Or if that is not possible, is there a simple way to display the unique values of a variable by group (without using the tab command)?

Thank you!

Problem with no observations of Leavin Lin Chu test

$
0
0

I'm trying to run Leavin Lin Chu's test, but stata returns with the message "no observations r (2000) how to fix this problem?

Array

simulating survival data NOT using survsim

$
0
0
Hi there,
I am trying to simulate survival data by rearranging the Weibull parametric survival function and can't seem to do it. If I use the survsim command it simulates the data perfectly but I want to understand how. Here is what I'm doing:
survsim - hazard rate=0.1, shape parameter=1.2, and HR=2 or log(HR) =0.693
set obs 500
gen trt = rbinomial(1,0.5)
survsim t, dist(weibull) lambda(0.1) gamma(1.2) covariates(trt 0.693)

this generates 500 survival times based of the specified parameters and weibull distribution, if I continue to make an event variable based of these survival times and then run a cox or weibull regression I get a HR for trt variable of 2 which is what I'm looking for.

rearranging weibull equation to solve for t using parameters above
set obs 500
gen trt = rbinomial(1,0.5)
gen t=(-ln(uniform())/0.1*exp(-2.30+0.693*trt))^(1/1.2)

this is the equation I've got for generating t for weibull distribution. However if I run this in stata I don't get the distribution of survival times I'm looking for, if I generate event variable based on survival time and run cox or weibull regression I dont get a hazard ratio of 2 for trt variable. I am definitely doing something wrong with the equation....... any suggestions?
Thanks,
Ben


Fama-French portfolio construction issue

$
0
0
Dear Statalists,

I am trying to follow the Fama French portfolio construction rule in order to build portfolios, that is, in each June, I sort all stocks into five different groups then I merge the annual portfolio data back to the monthly stock return data to calculate portfolio returns. Since the monthly stock return data always stars from January to December, I would like to match the monthly stock returns starting from each July to the next year's June with each annual portfolio data. Do you possibly know how can I merge these two datasets? Many thanks for your help in advance!

For example, something like this:

ym_P P ym_ret ret
1990m6 1 1990m7 0.1
1990m6 1 1990m8 0.2
1990m6 1 1990m9 0.5
1990m6 1 1990m10 0.7
1990m6 1 1990m11 0.8
.....
1990m6 1 1991m6 0.1
Viewing all 65481 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>