Manipulating Y axis using "stripplot" - getting rid of extra space/controlling margins

December 5, 2016, 2:34 pm

≫ Next: Regression on Continuous Variables with ANOVA Normalization

≪ Previous: merge with dates within time frames

Hi all,

I've been Googling an answer to this for FAR too long and am wondering if anyone can help.

I'm trying to graph a series of box plots using "stripplot" on STATA/IC 13.1 . I am plotting the interquartile range of birth weight (in grams) by year with 3 other line graphs overlaid:

smoothed lowess line for the full sample over time
smoothed lowess line for those in the 10th birth weight centile over time
smoothed lowess line for those in the 90th birth weight centile over time

This is my code:

Code:

 
stripplot birthweight, over(year) vertical ///
box(bfcolor(gs14) barw(0.2)) iqr(1.5) ms(none) ///
addplot(lowess birthweight year, lcolor(black)  yscale(range(1500 4500)) ylabel(1500(500)4500) || ///
lowess tenth1 year,  yscale(range(1500 4500)) ylabel(1500(500)4500) || ///
lowess ninety1 year,  yscale(range(1500 4500)) ylabel(1500(500)4500))

I've specified the yscale and ylabel above because I want to "zoom into" these values and not extend the Y axis to the full range of underlying birth weight data (which go from a min of ~450g to max of ~4700g).

After manually editing the size of the X/Y axis labels for clarity, my graph looks like this:
Array

::Please ignore the fact that the colours are terrible!:: All I want to do is get rid of that huge empty space at the bottom of the Y Axis and just zoom into the relevant part of the graph. It seems as though the margins are still adjusted to the default min/max values of the underlying raw data (mainly at the bottom/lower values of the Y axis).

Please can anyone please help me figure out how to manipulate these margins? I don't understand why the graph is produced this way.

Than you!

Catherine

↧

Regression on Continuous Variables with ANOVA Normalization

December 5, 2016, 2:57 pm

≫ Next: Subtracting Dates Code

≪ Previous: Manipulating Y axis using "stripplot" - getting rid of extra space/controlling margins

Dear Statalisters-

I would like to run a one-way ANOVA-type regression of a continuous variable on a categorical variable and get the output with ANOVA normalization instead of dummy-variable normalization.

I generated some data in which the grand mean is 50 and four group means of 44, 48, 52, and 56, respectively:

Code:

clear
set obs 1000
egen group = seq(), to(4) block(250)
set seed 112
gen y = 50 + (12*(group-1))/3-6 + rnormal(0,6)
reg y i.group

I get this output:

PHP Code:


. reg y i.group

      Source |       SS           df       MS      Number of obs   =     1,000
-------------+----------------------------------   F(3, 996)       =    174.74
       Model |  19051.9031         3  6350.63438   Prob > F        =    0.0000
    Residual |  36198.6462       996  36.3440223   R-squared       =    0.3448
-------------+----------------------------------   Adj R-squared   =    0.3429
       Total |  55250.5494       999  55.3058552   Root MSE        =    6.0286

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       group |
          2  |   3.229101   .5392144     5.99   0.000     2.170974    4.287227
          3  |   7.916969   .5392144    14.68   0.000     6.858843    8.975096
          4  |   11.41936   .5392144    21.18   0.000     10.36123    12.47749
             |
       _cons |   44.50601   .3812822   116.73   0.000     43.75781    45.25422
------------------------------------------------------------------------------

This is the typical dummy variable normalization in which the first group mean is suppressed and "sent" to the constant. So, the constant, beta_0, is the mean of Group 1; the Group 2 mean is (beta_0 + beta_1); the Group 3 mean is (beta_0 + beta_2), etc.

What I would like is for the output to be expressed as ANOVA-type normalization. In dummy variable normalization, the identifying restriction is to suppress the coefficient for one of the categories. In ANOVA-type normalization, the identifying restriction is that the coefficients sum to 0: summation beta_i = 0. So, the output for the above regression would give the grand mean, 50, as the constant, and all four groups would have regression coefficients: -6, -2, 2, and 6 (or thereabouts), respectively.

Does anyone know how to do this?

I greatly appreciate your help.

Best,
David

↧

Subtracting Dates Code

December 5, 2016, 3:28 pm

≫ Next: Test proportional odds assumption in a long-format data

≪ Previous: Regression on Continuous Variables with ANOVA Normalization

Hello,

I need help generating a new variable where it subtracts the 1st date from the 2nd date observation. My data is similar to that of below. This is from the same observation and I want to subtract the episode dates from each other. So the 2nd date-1st date and 3rd date-2nd date.

ID	Episode Date
4	3/21/2013
4	6/12/2014
4	8/31/2015

How would I generating a new var in doing so? Or How can I create a Binary variable that would mark the dates to be at least 1 month apart. 0-not 1 mo & 1=1 month apart?

Thanks

↧

Test proportional odds assumption in a long-format data

December 5, 2016, 5:08 pm

≫ Next: Variable "was int now float"

≪ Previous: Subtracting Dates Code

Hi,

I am going to perform -ologit on the dataset. The dataset is in long format with outcome variable is label as 3 ordinal levels (1, 2, 3). Before proceeding with -ologit, I need to use -omodel command to test the proportional odds assumption. In long format, each observation is not completely independent to each other, meaning some of them having working correlation between each other. So I will use -cluster(id) account for it. However, it seems that -cluster(id) is not allowed in -omodel command. I typed "omodel logit e gonad_total_1, cluster(pt_study_id)", and it returns as "options not allowed". So I am wondering if in STATA there is another way(s) to test proportional odds assumption while accounting for the intra-cluster correlation?

Thanks a lot! Your inputs are very much needed!

Regards,
Mengmeng

↧

Variable "was int now float"

December 6, 2016, 5:39 am

≫ Next: how to conduct bootstrapping for intraclass correlation (test-retest reliability)

≪ Previous: Test proportional odds assumption in a long-format data

Dear Stata users,
I would like to know your idea on something strange to me. I'm appending some dataset (in order to create one big dataset), and after have harmonized the variables I'm interested in, I used the command "append using" for each of my dataset I want to append. Everything works, and in the end I obtain my big dataset.

What i did not really get is why, in appending one of my dataset, Stata shows a message (not in red, so not an error which compromises the process) that states "var A was int now float": what does it mean? It happens only in one of the dataset that I appended, even if I coded and treated the variables exactly in the same way in all of them. Is it something whose I should be worried?

Thanks a lot, G.

↧

how to conduct bootstrapping for intraclass correlation (test-retest reliability)

December 6, 2016, 5:46 am

≫ Next: Making all my command with fweight

≪ Previous: Variable "was int now float"

Hi,

I performed intraclass correlation in STATA (. icc) to assess the test-retest reliability of my study participants (basically, they completed a similar set of questionnaire 1 week apart).

I then tried to use the prefix command bootstrap:
. bootstrap ICC = r(icc_i), reps(100) cluster(ID) : icc score ID visit

I ended with this error message:

Bootstrap replications (10)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
xxxxxxxxxx
insufficient observations to compute bootstrap standard errors
no results will be saved

Wondering whether anyone can help me with this?

Thank you

↧

Making all my command with fweight

December 6, 2016, 6:41 am

≫ Next: Generating a new variable as the mean of multiple variables

≪ Previous: how to conduct bootstrapping for intraclass correlation (test-retest reliability)

Hi all,

Actually I am using [fweight=var] in all my commands , I wonder if there is some method for statement in the begin of do-file to make all my calculation using [fweight=var] and avoid add fweight in each syntax.

Regard
Rodrigo

↧

Generating a new variable as the mean of multiple variables

December 6, 2016, 6:56 am

≫ Next: Create list / columns for regression coefficients of various (simultaneous) regressions

≪ Previous: Making all my command with fweight

Hi,

I want to create a new variable called mean wage which is going to measure the average wage by job. I have 2 wage variables in my data. One for the employees and one for the self-employed. I have been trying different combinations of the egen command like the below however they are obviously wrong and I am not sure how to correct them or if I shall be using a different command after all

Code:

egen meanwage= mean (employeewage)if employmentstatus==3 & mean (selfemployedwage)if employmnetstatus<3, by (job)

I know that for a single mean variable the code would be:

Code:

egen meanwage= mean (employeewage), by (job)

however I am confused as to how to combine 2 mean variables into 1

↧

Create list / columns for regression coefficients of various (simultaneous) regressions

December 6, 2016, 8:12 am

≫ Next: Collinearity problems

≪ Previous: Generating a new variable as the mean of multiple variables

Dear all,

I would need some help with a problem that occured in connection with my master thesis and, which, as I hope, can be answered relatively easily by more advanced users.
SpecificalIy, am examining the influence of executive compensation on risk taking in mergers and acquisitions.
Therefor, the risk variable is calculated as the difference between abnormal stock return volatility before and after the transaction.
The abnormal stock returns are calculated as the difference between observed and predicted stock returns.
For getting the predicted stock returns, I have to estimate two regressions for each transaction: one for the period before and one for the period after the transaction.

The corresponding table looks as follows:

No Date ReturnCompany ReturnIndex

1
1
1
.
.
840

I have already regressed three test periods using the code:

forvalues i = 1/3 {
reg ReturnCompany ReturnIndex if No==`i'
}

What I need now, would be a list of the estimated regression coefficients per estimation period (or a new column in my table for both of them, resprectively).

I've already tried listcoef (after installing spost13), but this only yields the coefficients of the last period.
If I use

forvalues i = 1/3 {
listcoef if No==`i'
}

(as above), Stata shows an error ("if not allowed")

The code
forvalues No = 1/3 {
listcoef
}

yields three times the result of the last period (in three tables).

Is there any possibility to (a) generate a single table including columns for (1) the number of the observed period, (2) the constant, (3) the beta (second coefficient) or (b) create a new row in my input table, with two columns for the coefficients, where these are equal for each number (row1) ?

I suppose, this is be an easy question for some of you, but I couldn't find an answer by searching on the Internet (for a really long time!)
So I would be more than grateful, if somebody would answer

.

Best regards
Marion

↧

Collinearity problems

December 6, 2016, 8:14 am

≫ Next: Fixed effects for large non-panel data

≪ Previous: Create list / columns for regression coefficients of various (simultaneous) regressions

Dear Statalist,
I have cross-sectional survey data and I want to determine how much an individual trusts others given the democratic history in his country. I constructed a variable 'demcapital' that measures the democratic history in the country and ranges from 0 to 1. My code in Stata is:
reg trust demcapital male age i.education i.religion i.socialclass lngdppercapita warincountry socialistpast colony_uk colony_esp africa asia i.country i.year, vce(robust)
However, when I regress this, Stata omits 5 countries and 1 year "because of collinearity".
Since the number of observations is not small (more than 100,000 observations), most part of the variables are statistically significant and not correlated, I have not been able to find the source of collinearity. Also, if I choose other dependent variables, where the number of observations shrinks to aproximatelly 40,000 the number of omitted countries and years dobles.
Could anyone advice me if it is an error in my model and should be more advisable to use a multilevel model?

Thank your for your attention.

↧

Fixed effects for large non-panel data

December 6, 2016, 10:07 am

≫ Next: Repeatability versus ICC versus Concord

≪ Previous: Collinearity problems

Hello,

This is my first Statalist post, and please forgive me if I'm not explaining the question clear enough.

I have a large dataset with over 4 million observations. Major variables include shopping trip ids (made by different individuals), products, date of purchase, price paid, location and etc. The goal is to run a few regressions with multiple fixed effects (i.e., controlling for product, store and time altogether). I couldn't make it a panel dataset using

PHP Code:


xtset product date

because of repeated time values. I could group trip ids and date of purchase to make a unique time id and then use xtset but I'm not sure if it will be the best way. Because trip ids are just random numbers, that makes time ids not in chronological order. Will that affect the use of panel data commands?

If not, would you suggest ways to set up the regression with a large non-panel dataset? Thanks a lot.

- Louise

↧

Repeatability versus ICC versus Concord

December 6, 2016, 11:25 am

≫ Next: Generate new variable from single variable with multiple values

≪ Previous: Fixed effects for large non-panel data

Dear all,

I would like to hear about you if there is any alternative to do Repeatability analisys instead ICC or Concord (Lin) , which , by the way, some times is difficult to choose between them.

In our validation study we apply the same questionnarie to the same person by the same interviewer. As I understood this is a repeatability analisys, but i'm having difficult to select the best command to evaluet this. Can someone help me ?

best wishes,

Larissa

↧

Generate new variable from single variable with multiple values

December 6, 2016, 12:25 pm

≫ Next: Generate New Variable that contains Character

≪ Previous: Repeatability versus ICC versus Concord

Hi Statalist,

I have a longitudinal dataset, with ~4 million observations and 3 variables (id, year_month, code). The complete dataset covers years (2013-2016) with each observation representing the day/month/year that users accessed health services. The variable year_month is simply the year and month corresponding to the date of the visit. The variable code represents what information was accessed from the clinic - 999=individual requested information ; with other numbers representing different kinds of information.

input double id int code float year_month
255652001019 999 653
255652001019 11 653
255652001019 41 653
255652001383 71 655
255652001383 62 655
255652001383 52 655
255652001383 64 655
255652001383 999 655
255652001383 31 655
255652001383 31 655
255652001383 82 655
255652001383 61 655
255652001383 0 655
255652001383 82 655
255652001383 999 655
255652001383 79 655
255652001880 999 674
255652001880 411 675
255652001880 423 675
255652001880 421 675
255652001880 426 675
255652001880 999 675
255652001880 422 675
255652001897 32 675
255652001897 66 675
255652001897 31 675
255652001897 999 675

end
format %tm year_month

I need to create a new dichotomous variable that identifies individuals who requested certain information from clinics. I am only interested if they first requested info (code==999) and if they asked for (code==11,82,79). I'm not double counting if individuals requested more than one type, i.e., 11&82.

I have tried several approaches, including the following, but have no luck.

Code:

bysort id year_month: gen engage_d2 = 1 if (code==11) & (code==999) | (code==82) & (code==999) | (code==79) & (code==999)

Any help would be much appreciated. Thanks!
Patrick

↧

Generate New Variable that contains Character

December 6, 2016, 1:10 pm

≫ Next: Make Windows Behave like Mac

≪ Previous: Generate new variable from single variable with multiple values

Hi,

I wan't to generate a variable that includes the characters that match "AAH" "AHC" "AMG" "AURORA". My code thus far is - gen facility=1 if strpos(_FACILITYNAME, "AAH" "AHC" "AMG" "Aurora" "AURORA").

. tab _FACILITYNAME

FACILITYNAME | Freq. Percent Cum.
----------------------------------------+-----------------------------------
16th St. Community Center-4512 | 1 0.27 0.27
16th St. Community Clinic (2)-1005 | 5 1.35 1.62
16th St. Community Health Cntr-Pkwy-1.. | 4 1.08 2.70
16th St.Community Health Ctr-Chavez-4.. | 3 0.81 3.51
AAH BROOKFIELD BLUEMOUND | 2 0.54 4.05
AAH GOOD HOPE ROAD CLINIC | 1 0.27 4.32
AAH MAYFAIR ROAD CLINIC | 1 0.27 4.59
AAH MILWAUKEE WEST CLINIC | 2 0.54 5.14
AAH NEW BERLIN CLINIC | 1 0.27 5.41
AAH WOMENS CARE FRANKLIN | 1 0.27 5.68
ACL Aurora MC Grafton I/F | 1 0.27 5.95
ACL Central/West Allis Mh | 1 0.27 6.22
ACL Laboratories | 34 9.19 15.41
ACL/AMG WEST ALLIS CENTRL LAB-AMG | 1 0.27 15.68
AHC/AAH NEW BERLIN PSC-AAH | 2 0.54 16.22
AHC/AAH WA PSC - AAH | 2 0.54 16.76
AHC/AMG DE PERE | 1 0.27 17.03
AHC/AMG EDGERTON | 9 2.43 19.46
AHC/AMG EDGERTON HEALTH CTR | 9 2.43 21.89
AHC/AMG OSHKOSH-WESTHAVEN | 1 0.27 22.16
AHC/AMG RACINE EAST | 1 0.27 22.43
AHC/AMG US BANK | 1 0.27 22.70
AHC/AMG WEST ALLIS FIREHOUSE SQ | 1 0.27 22.97
AHC/AMG WILKINSON MED CLN SUMMIT | 1 0.27 23.24
AHC/AUW WALKERS POINT | 1 0.27 23.51
AHC/AUW WOMENS HEALTH CENTER | 2 0.54 24.05
AURORA ADVANCED HEALTHCARE FWC | 1 0.27 24.32
AURORA ADVANCED HEALTHCARE NB | 1 0.27 24.59
AURORA ADVANCED HEALTHCARE RD | 1 0.27 24.86
AURORA ADVANCED HEALTHCARE WCC | 1 0.27 25.14
AURORA HEALTHCARE | 7 1.89 27.03
AURORA MEDICAL CENTER GRAFTON | 2 0.54 27.57
AURORA MEDICAL GROUP | 1 0.27 27.84

Above, this is the data set I'm working with.

. tab facility

facility | Freq. Percent Cum.
------------+-----------------------------------
1 | 12 100.00 100.00
------------+-----------------------------------
Total | 12 100.00

I only get 12 values for that code when there are clearly more containing those characters mentioned above. What options can I include to capture all those values with the characters or am I just using the wrong code.

Using STATA 14.1

Edit: Sorry the above tables are formatting where it's hard to read the tables

↧

Make Windows Behave like Mac

December 6, 2016, 1:34 pm

≫ Next: HAUSMAN test for neg bin Random and Fixed effect models

≪ Previous: Generate New Variable that contains Character

I recently switched from the Mac version of Stata 12 MP to the Windows version of Stata 14 IC. I'm probably one of the few that doesn't like to have multiple Stata windows open at the same time. Is there any way to change my settings so that running a program doesn't open up a new stata window? The problem is that when I don't pay attention I end up having 3 or 4 stata windows open at the same time and I will occasionally write in a command on the wrong window. I usually work on multiple do files using Notepad++.

↧

HAUSMAN test for neg bin Random and Fixed effect models

December 6, 2016, 3:34 pm

≫ Next: finding ID where more than one observation of varA is missing, but still listing all observations for that ID

≪ Previous: Make Windows Behave like Mac

Hi!

I'm using negative binomials models and I'm turning them into Random Effects Models and Fixed Effects Models (using for random effect the command "menbreg" and for the fixed effect the command "nbreg" with a specification.

I attached my hausman test, the results are unclear to me: the prob>chi2 is a high negative number.

Could someone help me to interpret it?
Is it normal that results completely change if I use "hausman random fixed" instead of "hausman fixed random"?

Thank you so much in advance,
Roberta

↧

finding ID where more than one observation of varA is missing, but still listing all observations for that ID

December 6, 2016, 3:46 pm

≫ Next: Update transition probabilities

≪ Previous: HAUSMAN test for neg bin Random and Fixed effect models

Hello,
I wish to list all observations of all subjects where at least one observation (of variable -make-) is missing (within variable ID). Thus, in the following reproducible code, I'm seeking to write a command that would list **all** observations of ID==1 and ID==2, not just the missing ones, and then count how many per ID are missing -make-.

I am exploring my data, and wish to visually examine for "patterns of missing-ness".
I'd like to write code that does something like this: list id make (if count >0 where make=="").
On the counting step,
for ID==1, then N=2
for ID==2, then N=1

clear
input id str10 make
1
1
1 "Ford"
2
2 "Chevrolet"
2 "Chevrolet"
3 "VW"
3 "VW"
3 "VW"
end
bysort id: l, sepby(id) noo N
bysort id: count if make==""

↧

Update transition probabilities

December 6, 2016, 4:24 pm

≫ Next: How can I duplicate the content of the variable taking into account the ID and the year?

≪ Previous: finding ID where more than one observation of varA is missing, but still listing all observations for that ID

Hello,

I am looking for a way to update the transition matrix obtained using the

Code:

xttrans

Code:

xttrans2

command.
The data I am working on is a panel dataset with individuals observed across 15 years.

Using the above command, I estimated the transition probabilities of the whole sample of switching from one level of the specified variable (in this case income) to another.
This estimation is easily made taking frequencies of observations that actually switched across the years and weighting in order to sum to 1.

My intention is to find a way to exploit the panel dimension, that is take this transition matrix for the whole sample as the prior for each individual and then update it using the actual path followed by the very individual, obtaining one transition matrix for each individual in the dataset that represents both the common prior and the specific path.

Any help with the code besides the statistical intuition would be very appreciated!

Thanks in advance,
Sincerely
Luca Gagliardone.

↧

How can I duplicate the content of the variable taking into account the ID and the year?

December 6, 2016, 5:06 pm

≫ Next: Probit Model with Binary endogenous regressor

≪ Previous: Update transition probabilities

This is my first request, so I ask to excuse possible "mistakes".

For the variable X, I have only expressions for one year (1977). However, I must also have expressions for this variable for the years 1978 and 1979. Since this variable can be regarded as "quasi-constant" I would like to fill the next years with the same content.

I want to duplicate the existing data of variable X. However, I have to consider the ID and the year. How can I duplicate the content of the variable taking into account the ID and the year? The persons for whom there is no expression of the variable X are to be deleted. In addition, the same ID can occur more often in the same year, because for example several people exist in the household. Do I have to consider this at all? In my opinion this is not a problem, as long as all the same IDs should have the same values for the variable X in all years (1977-1979).

Thank you for your support!

Ugur

↧

Probit Model with Binary endogenous regressor

December 7, 2016, 7:14 am

≫ Next: Hausman test issues

≪ Previous: How can I duplicate the content of the variable taking into account the ID and the year?

Dear Statalist,

Can anyone help me with the instrumental variable probit model and my data is cross section.

I have a dependent variable which is (y) binary has two values, and a set of exogenous variables and one endogenous variable (x) which is also binary. I want to use instrumental variables in order to correct and test for this. So I am using STATA 12 and I am using the command ivprobit so what I write is the following:

ivprobit Export Size Age Legal Regions Sector Sites_1 (I_A_Formal = Interaction)

Is this correct, and then can someone point out to me what is next and how to interpret the results please? and should I use Hausman test in order to test for endogenous and if yes, how can I do this with cross section data?

Thank you so much in advance,

↧