clustering with sequence analysis/om: can't handle dendrogram reversals

July 7, 2016, 8:32 am

≪ Previous: Replication of hh information

Dear StataList,

I am currently working on a sequences analysis (with 6 potential sequence states) and use the -sqom- command for optimal matching.

As a next step I would like to cluster my sqom-results. For this, I am using the command -sqclusterdat- procedure as described in “Sequence analysis with Stata” (Brzinsky-Fay/Kohler/Luniak, 2006, The Stata Journal6(4), 435-460).

My log-file looks like this:

Code:

. sqom, full k(2)
Perform 398278 Comparisons with Needleman-Wunsch Algorithm
Running mata function
Distance matrix saved as SQdist
 
. matrix dir
SQdist[893,893] levels[1,6]
 
. sqclusterdat
. clustermat wardslinkage SQdist, name(WARD) add

A problem occurs every time I try to produce a dendrogram. I type:

Code:

. cluster tree WARD, cutnumber(20)

And the Stata output is:

Code:

currently can't handle dendrogram reversals

I found a similar question in the old “statalist”, where Uli Kohler answered:

“So I can only add that during verification of -sqclusterdat- I quite regularly encounter the "currently can't handle dendrogram reversals" from cluster tree (and I don't remember whether it arises only with other methods than Ward's). I usually solve the problem by simply changing the cutnumber to something very close.” (22 Mar 2012)

Unfortunately, changing the cutnumber does not work for me. The error message occurs regardless of the cutnumber I specify - I tested cutnumber(1) to cutnumber(50).

What happened? What is wrong and how can I fix it?

I am grateful for any help and feedback!

Thank you!

Best Regards,
Doerthe

[stata version 14.1; windows 10]

↧

AUC when using CROSSFOLD

July 7, 2016, 8:46 am

≫ Next: dynamic poisson/negative binomial model with CROSS-SECTIONAL data

≪ Previous: clustering with sequence analysis/om: can't handle dendrogram reversals

Hello,

I am using Stata 14.1.

I have a quick question about the program CROSSFOLD. I am doing a 10-fold cross-validation with my data.Currently with this command it's possible to measure the model accuracy or fit using R2, mean absolute errors or root mean squared error (http://fmwww.bc.edu/repec/bocode/c/crossfold.html). However, I wanted to ask if there is a way to obtain the AUC instead of these measurements.

Thank you

Leo

↧

dynamic poisson/negative binomial model with CROSS-SECTIONAL data

July 7, 2016, 10:58 am

≫ Next: Help with Reshape

≪ Previous: AUC when using CROSSFOLD

Any suggestions would be greatly appreciated. We want to use a poisson or negative-binomial model for CROSS sectional data. I have examined the STATA list post on dynamic poisson models with panel data but am curious if problems exist when using lagged DV's in cross-sectional poisson models. Can someone recommend citations that discuss and address any problems? Also, can someone recommend any Stata ado's that may allow us to estimate these models? Thanks in advance.
Ed

↧

Help with Reshape

July 7, 2016, 11:32 am

≫ Next: Crossfold -- k-fold cross-validation

≪ Previous: dynamic poisson/negative binomial model with CROSS-SECTIONAL data

Hello,

I'm very new to reshaping with STATA and hope someone can help me with what I am sure are very basic problems.

My data looks like

PersonID QuestionID Answer Gender Race
001 1 1 1 1
001 2 0 1 1
001 3 1 1 1

I want to make Question ID wide, so the data looks like

Person ID Question1ID Question2ID Question 3ID Gender Race
001 1 0 1 1 1

In other words, one row per person, containing their gender, race, and answer to each question. I've tried every variety of variable combinations with the replace command and can't get what I need. My best guess is that it should be

reshape wide QuestionID Answer, i(Person ID) j (????)

but I don't know what variable should or even could be j. I sense, though, that I'm probably doing or not doing something else that is fundamental to the issue.

Any and all help is greatly appreciated.

Thanks,

Bryan

↧

Crossfold -- k-fold cross-validation

July 7, 2016, 12:42 pm

≫ Next: Conditional logistic regression analysis - how to appropriately weight to account for sampling bias?

≪ Previous: Help with Reshape

Hello,

I am a fairly elementary Stata user. Currently I am using Stata 14.1. I am trying to perform k-fold cross-validation using crossfold (http://fmwww.bc.edu/repec/bocode/c/crossfold.html).
However, I am having trouble understanding what the output is telling me -- even with the help file -- and how I reasonably choose a model. I am doing 10-fold cross-validation.

The crossfold gives the summary R2 (or another measure of model fit) for each attempt (in my case 10 attempts). I'm unsure what I do from there. If I take out or add a variable and then get another 10 attempts how do I compare the different models? Is there a way to get an average of the 10 attempts and then compare the two? Is this the best way to compare the models?

Thank you for any help you can provide!

Best
Leo

↧

Conditional logistic regression analysis - how to appropriately weight to account for sampling bias?

July 7, 2016, 1:14 pm

≫ Next: Random Forest: Predicting on a separate data set?

≪ Previous: Crossfold -- k-fold cross-validation

We have data from a prospective cohort study on ~450 individuals recruited from three sites. We would like to conduct conditional logistic regression analysis (conditioning on study site), and attempted to do so within the svyset structure as we also want to incorporate probability weights to correct for potential sampling bias. However, we were unable to designate the study site variable as the 'group' within the syntax (svy: clogit outcome exposure covariates, group(site) because the program seems to require the 'group' variable to be nested within the PSU (which we had not desigated). When we then designated site as the PSU, we received a second error message saying that the weights need to be equal for all observations within the PSU, which is not the case in our data.

Is there a way to get around these restrictions so we can designate the study site as the 'group'? Or is there another approach to conduct conditional logistic regression analysis that also incorporates our probability weights?

Thanks so much in advance for your insights!

Anisha
Columbia University

↧

Random Forest: Predicting on a separate data set?

July 7, 2016, 2:41 pm

≫ Next: suppress omitted variables

≪ Previous: Conditional logistic regression analysis - how to appropriately weight to account for sampling bias?

Hello,

I am an elementary stata user. I used chaidforest to train a random forest classification. However, I have some problem predicting with the "predict" function on the dataset other than my training dataset. Is there any way to predict on the "test" set?

Thank you
Best
Nazanin

↧

suppress omitted variables

July 7, 2016, 3:05 pm

≫ Next: How to drop data based on 2 variables

≪ Previous: Random Forest: Predicting on a separate data set?

Is that a way to suppress omitted variables in the regression output?

↧

How to drop data based on 2 variables

July 7, 2016, 4:13 pm

≫ Next: How to use Vuong test to compare Heckman and Double-hurdle models

≪ Previous: suppress omitted variables

Hi!

I am hoping to get somebody's help, I am new to STATA and I'm using version 14.1.

I would like to drop some data entry points based on two conditions (that they are postera=="1" and that they fall before the date 1680920050101).

I am trying to use the drop command like this but I am getting an error message :

drop if date1<1680920050101 and postera=="1"
"invalid 'and' " r(198)

What am I doing wrong ? I want to eliminate patients who entered my cohort before this date and are erronously coded as postera=="1"

Thank you!
Maria

↧

How to use Vuong test to compare Heckman and Double-hurdle models

July 8, 2016, 7:45 am

≫ Next: filling empty data in variable from other values of variable

≪ Previous: How to drop data based on 2 variables

Hello everyone!
I am trying to use Vuong test to compare Heckman model with Double hurdle model.
But, in the first hurdle from the Heckman model, the results shown that two variables only have the coefficients and all other estimates are missing (e.g., Std.Err., t, P>|t|, and Interval "-6.900408 . . . . ."). I checked there is no multi-collinearity and I don't know why since in the Double hurdle model, there is no problem.
So, is it possible I can still compare these two models using the Vuong test in which these two variables were deleted in the first hurdle in the Heckman model (the number of explanatory variables are not equal in the first hurdle for these two models)?
If so, how to conduct the Vuong test?

Any help and suggestion would be greatly appreciated.

Thanks,

Hua

↧

filling empty data in variable from other values of variable

July 8, 2016, 8:17 am

≫ Next: Comparing maps built with spmap

≪ Previous: How to use Vuong test to compare Heckman and Double-hurdle models

Am using a panel dataset with the following columns:
name date close dprob1

I would like to make a new column weighted_equity_change using something like the following:
gen wghtd_eqty_chng = (close - l1.close)/dprob1 if date ==td("ddmmyyyy")

This creates a new variable, but only for that specific date. How can I set the missing values for each name equal to this new value? In other words, I want something like:
by name: replace wghted_eqty_chng = wghtd_eqy_chng if date == td("ddmmyyyy")

Thanks in advance for help.

↧

Comparing maps built with spmap

July 8, 2016, 8:35 am

≫ Next: Trouble importing XML

≪ Previous: filling empty data in variable from other values of variable

Dear Statalist
I constructed two maps using spmap to display the kernel estimates of the probability distributions of event X and event Y. I used spgrid and spkde to estimate these distributions.
The two maps are built on the same geographical area A. Please see the png image below.
Array
Now I need to statistically compare these two maps in order to determine the degree of overlapping between the two spatial patterns. In other words, I need a metric to determine whether the two spatial distributions of the data are correlated or not.
What I did was to compute a simple pairwise correlation between the two probability distributions created by spkde across all s_g cells that form the grid covering A.
However, I’m not sure that this is the correct way to proceed.

Is there a routine in Stata that could help me in that?
Thanks a lot in advance for any suggestion!!
Best
Maria

↧

Trouble importing XML

July 8, 2016, 8:57 am

≫ Next: Problem with biprobit with cluster

≪ Previous: Comparing maps built with spmap

Hi All,
Apologies for a new post on this topic. However, I searched through the forum and didn't find an existing resolution to my query.

I'm trying to import xml files from USPTO website into Stata (ver. 14.1) but haven't been successful. I tried the File>Import>XML Data from the point and click options as well as using the cmd <xmluse> with option <doctype(dta)>. However, I get the the message <unrecognizable XML doctype>. I also searched through the net and this forum and found out that I'm not the only one having trouble importing xml files into stata, specially the ones provided by USPTO.

Here's a link to the files I'm trying to import <https://bulkdata.uspto.gov/data2/patent/grant/redbook/bibliographic/2015/>. There's also a file with extn. <dtd> at the bottom, which I think has got something to do with importing the xml files, but I'm not sure what and how to handle that. These are rather heavy files, which is why I'm also unable to convert them online through xml>csv converters.

It would be great help if someone can advice me on how to import such files into Stata correctly, or any program that can convert .xml files into another format that can be imported into Stata easily.

Appreciate any help in this regard

Thanks
ash

↧

Problem with biprobit with cluster

July 8, 2016, 9:29 am

≫ Next: mi impute chained (nbreg) issue?

≪ Previous: Trouble importing XML

Dear Statalisters,
I am running a biprobit with cluster.

I ran the command multiple times and I obtained the same coefficients but two different standard errors.
Also the degrees of freedom of the Wald test (which is missing) vary between 29 and 30.

What could be the reason?
Many thanks in advance

↧

mi impute chained (nbreg) issue?

July 8, 2016, 11:07 am

≫ Next: fixed effects by gvkey

≪ Previous: Problem with biprobit with cluster

I'm using Stata 14.1. I have panel data observed at baseline and at 6 follow-up assessments. The response variable is an over dispersed count variable with missing values. 15 of 226 observations were missing at all follow-up assessments. I'm trying to generate 20 complete data sets using chained equation. There is no missing data for any of the right hand side variables. Here is my code:

Code:

mi impute chained (nbreg) acount1 acount3 acount6 acount9 acount12 acount15 ///
    = cond age gender white hispanic black schfull schpart employed ///
    agealc nalabdis agemj nmjabdis blarate blbrate blmjrate ///
    , add(20) force

I then get a table summarizing the imputations:

------------------------------------------------------------------
| Observations per m
|----------------------------------------------
Variable | Complete Incomplete Imputed | Total
-------------------+-----------------------------------+----------
acount1 | 195 31 27 | 226
acount3 | 178 48 42 | 226
acount6 | 173 53 45 | 226
acount9 | 162 64 52 | 226
acount12 | 161 65 53 | 226
acount15 | 160 66 55 | 226
------------------------------------------------------------------
(complete + incomplete = total; imputed is the minimum across m
of the number of filled-in observations.)

Note: Right-hand-side variables (or weights) have missing values;
model parameters estimated using listwise deletion.

The command is not fully populating the imputed data sets. For acount1 the minimum imputed is 27, while 31 observations have incomplete data. This occurs when I use the orderasis option or the nomonotone option. It also occurs if I specify:

mi impute chained (poisson). However, if I treat the data as continuous and specify mi impute (reg) I'm able to generate 20 fully populated imputed data sets, albeit the wrong model specification for this outcome.

If I don't use the force option I get an r(498) error:

acount6: missing imputed values produced
This may occur when imputation variables are used as independent variables or when independent variables contain missing values. You can specify option force if you wish to proceed anyway.

Finally, if I use the noimputed option the procedure runs without error and fully populates all 20 imputed data sets. However, since these are panel data and most subjects are observed on multiple occasions, this would exclude the most valuable information regarding missing values.

Any thoughts or suggestions would be much appreciated.

Brad

↧

fixed effects by gvkey

July 8, 2016, 11:13 am

≫ Next: How do I count number of times a variable changes?

≪ Previous: mi impute chained (nbreg) issue?

I'm using stata 13 in a windows 10 environment

I'm trying to run a linear regression with firm and year fixed effects. For the firm FE, I'm using the gvkey as i.gvkey.

The thing is, I keep getting an error of too many variables. I changed matsize to 11000 (most allowed) and this error keeps showing up.

Anyone have some insights about what I'm doing wrong and how should I do it?

↧

How do I count number of times a variable changes?

July 8, 2016, 11:53 am

≫ Next: tab2xl - "if" option?

≪ Previous: fixed effects by gvkey

I'm doing a study where I need to figure out how often a certain facility changes its type during the study period (ex. type=0 if small facility, type=1 if big facility) and the facility type changes from year to year. (ex. in 2004 and 2005 they would be type 0, 2006 type 1, 2007 type 0). So in this case, facility A would have a total of 2 type changes during the study period (went from type 0 to type 1 to type 0).

Could you help me figure this out? I've never used loops before or macros so am at a lost. Here's an example of the data and I'm looking for total_change. Thank you

ID Facility Year Type total_changes
1 A 2004 1 3
2 A 2004 1 3
3 A 2004 1 3
4 A 2004 1 3
5 A 2005 0 3
6 A 2005 0 3
7 A 2005 0 3
8 A 2006 1 3
9 A 2006 1 3
10 A 2007 0 3
11 A 2007 0 3
12 A 2007 0 3

14 B 2004 1 0
15 B 2004 1 0
16 B 2004 1 0
17 B 2004 1 0
18 B 2006 1 0
19 B 2006 1 0
20 B 2006 1 0

↧

tab2xl - "if" option?

July 8, 2016, 12:34 pm

≫ Next: Interpretation of fmlogit variables

≪ Previous: How do I count number of times a variable changes?

Hello, I'm currently trying to organize a large number of tabulations in an excel file. tab2xl seems to be the most easy-to-use command for doing so, especially because it contains row totals, which I need, and I don't know how to include them when using putexcel, which is considerably clunkier. However, many of the tabs contain if statements (sometimes multiple ones) - is there any way for tab2xl to support this? And if not, is there an easy way of getting totals using putexcel? I have included an example of what I mean, including the problematic if statement in the last line (that doesn't work).

Thanks!

Code:

disp "Race distribution in intertwilight period"
putexcel F`samplename'=("Race distribution in intertwilight period" )    
tab driverrace if intertwilight==1, matcell(freq) matrow(names)    
tab2xl driverrace if intertwilight==1 using Results, row(`top') col(6) sheet(2010)

↧

Interpretation of fmlogit variables

July 8, 2016, 3:45 pm

≫ Next: Testing for difference in marginal effect

≪ Previous: tab2xl - "if" option?

I am very new to stata and using Maarten Buis' -FMLOGIT- (SSC) to model proportions for four outcomes (proportions of orders filled through a specific shipping method). The solution uses one outcome as a base case, When interpreting coefficients of the independent variables, how is the base case treated? I am looking for effects of the independent variables on the proportions of the outcome. So how do I treat the base case?

↧

Testing for difference in marginal effect

July 8, 2016, 3:54 pm

≫ Next: how to interpret "/logs" in output from meglm gamma family log link

≪ Previous: Interpretation of fmlogit variables

Hi there,

I have looked everywhere for a solution to my problem, but nothing worked: so here it goes.

I estimated the marginal effect of age on price for dataset of real estate properties. I estimated the effect of age using both a linear and quadratic element, to get a non-linear marginal effect of age on price. My dataset contains properties with age ranging from 0-75 years.

This al worked fine, but now I want to test whether different types of real estate differ in the marginal effect. In order to test this I ran the following code:

reg lnpm $varsdef c.age#i.soort c.age#c.age#i.soort, robust
margins, dydx(age) over(soort age) post

Where $varsdef contains age, age squared and a bunch of control variables. And soort is a categorical variable with 4 groups.

So the result is that I get 4*75 marginal effects of age on price and I want to test whether the effect differs significantly between groups. Which looks like this:

margins, dydx(age) over(soort age) post
Average marginal effects	Number of	obs =	1,969
Model VCE : Robust
Expression : Linear prediction, predict()
dy/dx w.r.t. : age
over : soort age

Delta-method
dy/dx Std. Err. t	P>t	[95% Conf.	Interval]

age
soort#age
Office# 0 -.025091 .0038006 -6.60	0.000	-.0325448	-.0176372
Office# 1 -.0244726 .0036964 -6.62	0.000	-.0317219	-.0172233

I think in order to show that the marginal effect differs between groups, it should be a combination of

testnl _b[1.soort#0.age] = _b[2.soort#0.age] _b[3.soort#0.age] = _b[4.soort#0.age]
testnl _b[1.soort#1.age] = _b[2.soort#1.age] _b[3.soort#1.age] = _b[4.soort#1.age]
..etc.

Is There a way to get this combination for each individual age without having to repeat this line of code 75 times? And does this show whether the marginal effect differs between groups?

Your help is very much appreciated!

↧