Quantcast
Channel: Statalist
Viewing all 65664 articles
Browse latest View live

Calculating an optimal level of debt ratio

$
0
0
Hello,
i am conducting a study about the relationship between debt ratio and return on assets in Swedish real estate companies. Im trying to find either the optimal level or an interval which indicates the ratio of debt. The code i ran is to calculate the optimal level based on the data from 2013. My command for debt ratio is Kortfrist_sku and ROA for return on assets.

The code i've used is the following:

// Step 1: Filter the Data for the Year 2013
qui keep if År == 2013

// Step 2: Run a Regression Analysis
regress ROA Kortfrist_sku

// Step 3: Calculate the Optimal Value
predict ROA_hat
summarize Kortfrist_sku, detail
local max_ROA = max(ROA_hat)
qui replace Kortfrist_sku = . if ROA_hat != `max_ROA'

// Step 4: Display the Optimal Value
di "The optimal level of Kortfrist_sku for maximum ROA in 2013: " Kortfrist_sku

I am wondering if this code looks correct, I am getting a value that seems correct (0,04125). Is there any way to confirm this or to generate an interval where i can confirm the value?

Thanks in advance,
Vincent

Generating new variables for partner?

$
0
0
Dear stata community,

I am fairly new to STATA and am trying to wrap my head around creating new variables from the content of the partner observation.
I have a wide dataset where each observation is a single individual. let's say it looks like the following.

ID PartnerID Var1 Var2
1 5 “hi” “ho”
2 3 “cat” “flower”
3 2 “bird” “stone”
4 . “Frog” “cycle”
5 1 “Jupiter” “lollipop”


Now I am attempting to generate each partner's variables as variable's upon the index-person's observation. Like this:

ID PartnerID Var1 PartnerVar1 Var2 PartnerVar2
1 5 “hi” “Jupiter” “ho” “lollipop“
2 3 “cat” “bird” “flower” “stone”
3 2 “bird” “cat” “stone” “flower”
4 . “Frog” . “cycle” ""
5 1 “Jupiter” “hi” “lollipop” “ho”


The following syntax worked fine initially:
gen PartnerVar1 = Var1[PartnerID]

Yet, it is dependant on the ID-variable being a steady sequence without interruptions. If it breaks (e.g. 123 5678 10 15) there will be a mismatch.

Do any of you have suggestions on how to match, not by row number but by the content of PartnerID? Preferably without using a Foreach/Forval loop as there are aprox. 70 variables and 1,000,000 observations.

Kind regards,
Joel

Eliminate non-consecutive years in pannel data

$
0
0
Greetings,

I´m working in a dataset that goes from 2010-2022. I have a panel with firm and year. I'm cleaning the data that contains some dependent variable missing values, and I used the following code to select the firms with at least 5 consecutive years:

gen run = .
by id: replace run = cond(L.run == ., 1, L.run + 1)
by id: egen maxrun = max(run)
by id: drop if maxrun <5

Now, some firms have breaks in the years of the dependent variable. For example, 6 consecutive years (2010-2015) and 2 consecutive years (2018-2019).
I want to eliminate the break with less than 5 consecutive years per company, but I can't find any code to run ( in the example, above I just want to drop the second break from 2018-2019).

Any suggestion,
Thank you
Nuno

interpreting an IRT graded response model

$
0
0
Hello,
Sorry for the basic question, but I was just wondering which was the most important statistic when interpreting an IRT graded response model? In the attached, there are five survey items in the model (responses on each item range from 1-5 on an ordinal scale). If I want to determine the one survey item which best predicts the others, which is the most relevant statistic? The coefficient of the discrim diff?
Many thanks for reading,
Tom

Calculating QoL scores (qlq-c30)

$
0
0
I am using the Quality of life (QoL) instrument called EORTC qlq-c30. I am trying to scooring it based on the oficial scoring guide for this instrument where user written stata coding is provided. I have atached the scoring manual for stata which can also be found following the below link - (or google "eortc qlq-c30 scoring manual" if not feeling like following others link) - please go to page 68.
https://www.eortc.org/app/uploads/si...2/SCmanual.pdf

When I finally run the command "qlqscal 3" stata provides me with no errormessage, but as far as I can see nothing happens (no new variables are created).

I must be doing something wrong here - it is my first time working with ado files.

What i have done
1. copy pasted the entire page 70 into the dofile editor - saving it as qlqsub.ado in my personal folder (documents/stata/ado/personal)

2. copy pasted the entire page 71 into the dofile editor - saving it as qlqscal.ado in my personal folder (documents/stata/ado/personal)

3. copy pasted the second column of page 73 and the entire page 74 into the dofile editor - saving it as qlqlabl.ado in my personal folder (documents/stata/ado/personal)

#1-3 is attached

4. loading the dataset with the questionaire data (as provided in the dataex here)
Code:
* Example generated by -dataex-. For more info, type help dataex
clear
input int record_id byte(q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q14 q15 q16 q17 q18 q19 q20 q21 q22 q23 q24 q25 q26 q27 q28 q29 q30)
403 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 5 5
404 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 7 7
406 3 3 1 1 1 2 2 1 3 2 1 2 2 1 1 1 2 2 3 2 2 3 1 1 2 2 1 1 5 5
416 3 4 3 4 1 2 2 1 4 3 4 2 4 2 2 3 3 3 4 1 1 1 1 2 1 2 2 1 2 2
421 4 3 1 2 1 3 2 1 1 1 1 1 1 1 1 4 3 2 1 1 1 1 1 1 3 1 1 1 6 4
424 1 1 1 1 1 2 2 1 2 2 2 1 1 2 1 1 2 2 2 1 3 3 1 2 1 1 1 1 3 3
428 1 2 1 1 1 3 1 1 2 2 2 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 4 4
432 3 3 2 1 1 2 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 4 1 2 1 1 1 1 6 6
435 2 3 1 2 1 3 2 2 2 2 4 3 3 2 1 2 2 2 1 2 3 3 2 3 2 1 2 1 4 4
446 3 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 6 6
end

5. runnign the following code
Code:
personal dir

qlqscal 3
As I have understood the scoring manual p 68 loads of new variables should be created. In my siuation it seems as the qlqscal 3 command is accepted but that nothing comes out of it.

I would be very gratefull if anyuone can steer me in the right direction.

Reghdfe

$
0
0
Hi everyone,
I have a question.
My dependent variable is a score ranging from 0 to 100. It is a continuous variable. I want to include both country and industry-fixed effects.
Could I use reghdfe regression?

Different results in xtivreg first stage and xtreg only the first stage

$
0
0
Dear all,

I am currently run a regression with -xtivreg- to study the long-term effect with IVFE and logged model. However, I find my the results are different in xtivreg2 first stage and only xtreg first stage. The results are like:
xtivreg disease L1.age L1.age2 L1.inter1 L1.inter2 L1.marital (L1.x = L1.indicator) i.wave if L1.age>=55 & L1.age<=75 & labor_force==1, first fe vce(r)

First-stage within regression

Fixed-effects (within) regression Number of obs = 13,607
Group variable: newid Number of groups = 9,824

R-squared: Obs per group:
Within = 0.1613 min = 1
Between = 0.0201 avg = 1.4
Overall = 0.0068 max = 2

F(7,9823) = 55.07
corr(u_i, Xb) = -0.3467 Prob > F = 0.0000

(Std. err. adjusted for 9,824 clusters in newid)
------------------------------------------------------------------------------
| Robust
__000004 | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------

L1.age | .0313144 .0170861 1.83 0.067 -.0021778 .0648066
|
L1.age2 | .0042818 .001345 3.18 0.001 .0016453 .0069182
|
L1. inter1| -.0517112 .0138465 -3.73 0.000 -.0788532 -.0245691
|
L1.inter2 | -.0072432 .0014281 -5.07 0.000 -.0100425 -.0044438
|
L1. marital| -.0054045 .035118 -0.15 0.878 -.074243 .063434
|
wave |
2 | 0 (empty)
5 | 0 (empty)
6 | -.1181368 .0220845 -5.35 0.000 -.1614269 -.0748466
7 | 0 (omitted)
|
L1.indicator | .1362696 .0235653 5.78 0.000 .0900768 .1824625
|
_cons | .843435 .047912 17.60 0.000 .7495177 .9373523
-------------+----------------------------------------------------------------
sigma_u | .48187339
sigma_e | .18851021
rho | .86727289 (fraction of variance due to u_i)
------------------------------------------------------------------------------

Fixed-effects (within) IV regression Number of obs = 41,102
Group variable: newid Number of groups = 24,212

R-squared: Obs per group:
Within = 0.0580 min = 1
Between = 0.0105 avg = 1.7
Overall = 0.0168 max = 4


Wald chi2(9) = 6035.37
corr(u_i, Xb) = 0.0066 Prob > chi2 = 0.0000

(Std. err. adjusted for 24,212 clusters in newid)
------------------------------------------------------------------------------
| Robust
disease | Coefficient std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
x |
L1. | .0696582 .03052 2.28 0.022 .0098401 .1294764
|
age |
L1. | -.0023588 .00477 -0.49 0.621 -.0117078 .0069903
|
age2 |
L1. | .0001673 .0003024 0.55 0.580 -.0004254 .0007599
|
inter1 |
L1. | -.0002404 .003623 -0.07 0.947 -.0073413 .0068605
|
inter2 |
L1. | .0009104 .0005197 1.75 0.080 -.0001083 .0019291
|
marital |
L1. | -.0050013 .0111639 -0.45 0.654 -.0268821 .0168796
|
wave |
5 | .0400923 .0202437 1.98 0.048 .0004152 .0797693
6 | .0575173 .0257137 2.24 0.025 .0071193 .1079152
7 | .0726076 .0313591 2.32 0.021 .0111449 .1340703
|
_cons | .0428386 .0272541 1.57 0.116 -.0105785 .0962558
-------------+----------------------------------------------------------------
sigma_u | .3393069
sigma_e | .14958866
rho | .8372669 (fraction of variance due to u_i)
------------------------------------------------------------------------------
Instrumented: L.x
Instruments: L.age L.age2 L.inter1 L.inter2 L.marital 5.wave
6.wave 7.wave L.indicator





If I only run the first stage with xtreg, the result is following:
xtreg L1.x L1.age L1.age2 L1.inter1 L1.inter2 L1.marital L1.indicator i.wave if L1.age>=55 & L1.age<=75 & labor_force==1,fe vce(r)

Fixed-effects (within) regression Number of obs = 41,124
Group variable: newid Number of groups = 24,222

R-squared: Obs per group:
Within = 0.4045 min = 1
Between = 0.4217 avg = 1.7
Overall = 0.4143 max = 4

F(9,24221) = 669.45
corr(u_i, Xb) = 0.2598 Prob > F = 0.0000

(Std. err. adjusted for 24,222 clusters in newid)
------------------------------------------------------------------------------
| Robust
L.x | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
age |
L1. | .0655449 .0070952 9.24 0.000 .0516379 .0794519
|
age2 |
L1. | .0037481 .0004888 7.67 0.000 .00279 .0047062
|
inter1 |
L1. | -.0500509 .0062653 -7.99 0.000 -.0623313 -.0377705
|
inter2 |
L1. | -.0067018 .0005848 -11.46 0.000 -.0078481 -.0055555
|
marital |
L1. | .0092198 .0173753 0.53 0.596 -.0248368 .0432764
|
indicator |
L1. | .1974026 .0129819 15.21 0.000 .1719572 .2228481
|
wave |
5 | .1459084 .0324698 4.49 0.000 .0822656 .2095512
6 | .2091353 .0407499 5.13 0.000 .129263 .2890076
7 | .2750966 .0494694 5.56 0.000 .1781335 .3720597
|
_cons | .4403559 .0359106 12.26 0.000 .3699688 .510743
-------------+----------------------------------------------------------------
sigma_u | .36985621
sigma_e | .25082088
rho | .68497935 (fraction of variance due to u_i)
------------------------------------------------------------------------------



I really cannot figure out why my sample size changes that much and why the results are different. I would appreciate if anyone can help me with it. Thank you so so so so much!

test post

$
0
0
However, if I restrict the code to summary stats such as mean and SD. The customized labelling works. Once I add ratio stats like proportion or percent, the customized labels AND the variable labels appear.
This is the code from the blog, slightly modified.

Code:
webuse nhanes2l , clear
collect clear
collect dims


table (sex) (highbp), ///
statistic(frequency) ///
statistic(percent) /// statistic(mean age) /// statistic(sd age) /// nototals /// nformat(%9.0fc frequency) /// nformat(%6.2f mean sd) /// sformat("%s%%" percent) /// sformat("(%s)" sd)
collect dims collect preview collect label levels highbp 0 "No" 1 "Yes" collect label dim highbp "Hypertension", modify collect preview collect label list result, all /* This is the start of where I don't get the desired results */ collect label levels result frequency "Freq." ///

Issue when plotting confidence interval plots

$
0
0
Dear all,

I'm trying to compare 2 different confidence interval plots of the growth rate of 2 diseases in 2 different image modalities and am having some issues when combining these interval plots.
When calculating the mean and confidence interval of the data of one of the diseases (GA) I get a mean of ~1.66 with a 95% CI of 1.496 to 1.834 for one of the image modalities (FAF), when i put this in an interval plot everything looks good according to the means and confidence intervals (first plot, first CI). However when I try to combine the interval plots from the 2 diseases, suddenly the mean and CI of disease GA in image modality FAF is completely different (see second picture, first plot). I don't know why this is happening or how I can fix this. Could someone help me please?

THank you!


Customizing table labels with 'collect label levels results' works as desired with summary statistics but not with ratio statistics

$
0
0
Hello.

I am using Stata 18 on a MacOS 10.15.7. The collect command is supposed to allow modification of labels without modifying the variable labels. This is described in the Stata blog Customizing Tables part 2

Following these commands from the blog which generate summary (mean, sd) and ratio (percent) stats, I am unable to recreate desired modified table labels where only the modified labels appear. Instead, I get the modified label plus the variable label (from the dataset) appearing.
However, if I restrict the code to summary stats such as mean and SD. The customized labelling works. Once I add ratio stats like proportion or percent, the customized labels AND the variable labels appear.

I am not sure if I am using collect incorrectly or it is not responding as it should.

This is the code from the blog, slightly modified.

To create the table :

Code:
webuse nhanes2l , clear
collect clear
collect dims

table (sex) (highbp), ///
statistic(frequency) ///
statistic(percent) ///
statistic(mean age) ///
statistic(sd age) ///
nototals
This is the table as generated
Code:
-----------------------------------------------
                       |   High blood pressure
                       |          0           1
-----------------------+-----------------------
Sex                    |                       
  Male                 |                       
    Frequency          |      2,611       2,304
    Percent            |      25.22       22.26
    Mean               |                       
      Age (years)      |    42.8625    52.59288
    Standard deviation |                       
      Age (years)      |    16.9688    15.88326
  Female               |                       
    Frequency          |      3,364       2,072
    Percent            |      32.50       20.02
    Mean               |                       
      Age (years)      |   41.62366    57.61921
    Standard deviation |                       
      Age (years)      |   16.59921    13.25577
-----------------------------------------------
I can modify the labels by :

Code:
collect label list result, all
collect label levels result frequency "Freq." ///
    mean      "Mean (Age)" ///
    percent   "Percent" ///
    sd         "SD (Age)" ///
    , modify
This shows the desired labels correctly :

Code:
collect label list result, all
The resulting table is not as expected because the labels become the collect label plus the variable label Mean (Age) Age (years) and SD (Age) Age (years)

Code:
. collect label list result

  Collection: Table
   Dimension: result
       Label: Result
Level labels:
        mean  Mean (Age)
     percent  Percent
          sd  SD (Age)

. collect preview

------------------------------------------
                  |   High blood pressure
                  |          0           1
------------------+-----------------------
Sex               |                       
  Male            |                       
    Percent       |      25.22       22.26
    Mean (Age)    |                       
      Age (years) |    42.8625    52.59288
    SD (Age)      |                       
      Age (years) |    16.9688    15.88326
  Female          |                       
    Percent       |      32.50       20.02
    Mean (Age)    |                       
      Age (years) |   41.62366    57.61921
    SD (Age)      |                       
      Age (years) |   16.59921    13.25577
------------------------------------------
The variable label is seen here

Code:
describe age

Variable      Storage   Display    Value
    name         type    format    label      Variable label
------------------------------------------------------------------------------------------------------------------------------------
age             byte    %9.0g                 Age (years)
If I restrict the code to only summary statistics and not ratio statistics, I get the desired labels : Mean (Age) and SD (Age)

Code:
webuse nhanes2l , clear
collect clear
collect dims

qui table (sex) (highbp), ///
    statistic(mean age) ///
    statistic(sd age) ///
    nototals

collect label levels result ///
    mean      "Mean (Age)" ///
    sd         "SD (Age)" , modify
The labels are correctly modified and the resulting table has the correct labels.

Code:
. collect label list result
  Collection: Table
   Dimension: result
       Label: Result
Level labels:
        mean  Mean (Age)
          sd  SD (Age)

. collect preview

---------------------------------------
               |   High blood pressure
               |          0           1
---------------+-----------------------
Sex            |                       
  Male         |                       
    Mean (Age) |    42.8625    52.59288
    SD (Age)   |    16.9688    15.88326
  Female       |                       
    Mean (Age) |   41.62366    57.61921
    SD (Age)   |   16.59921    13.25577
---------------------------------------
I am able to recreate the error if I add any ratio statistic, this time the statistic (proportion).
The combined labels of collect and the variable label Mean (Age) Age (years) and SD (Age) Age (years) appear again in the table.

Code:
webuse nhanes2l , clear
collect clear
collect dims

qui table (sex) (highbp), ///
    statistic(proportion) /// /*THIS WAS ADDED*/
    statistic(mean age) ///
    statistic(sd age) ///
    nototals

collect label levels result ///
    mean      "Mean (Age)" ///
    sd         "SD (Age)" , modify
You can see that the labels are modified correctly.
Code:
collect label list result

  Collection: Table
   Dimension: result
       Label: Result
Level labels:
        mean  Mean (Age)
  proportion  Proportion
          sd  SD (Age)
However, the table once again have both the collect labels and the appended variable labels.

Code:
collect preview

------------------------------------------
                  |   High blood pressure
                  |          0           1
------------------+-----------------------
Sex               |                       
  Male            |                       
    Proportion    |      .2522       .2226
    Mean (Age)    |                       
      Age (years) |    42.8625    52.59288
    SD (Age)      |                       
      Age (years) |    16.9688    15.88326
  Female          |                       
    Proportion    |       .325       .2002
    Mean (Age)    |                       
      Age (years) |   41.62366    57.61921
    SD (Age)      |                       
      Age (years) |   16.59921    13.25577
------------------------------------------
I don't see why it should be behaving differently depending on type of statistic requested. Is there a way to get the desired labels only using collect without the appended variable labels?

Visualizing data using bar chart?

$
0
0
Hello! I want to present some data using a bar chart. The statistics of interest is the mean of a binary variable (Mort30d), and I want to present it over a grouping variable (EDLOSGroup).
The correct statistics are obtained using the following command:
graph bar (mean) Mort30d, over(EDLOSGroup)
However, I would like each individual bar to include information about another grouping variable (TriageLevel) using different colors. I tried to search this forum, Youtube and I even spent over an hour with ChatGPT trying to solve this problem. The end result should look something like this: Array

On the y axis we have percent of 30-day mortality. On the x axis the groups (EDLOSGroup) and the colors represent the 6 possible values for TriageLevel.
The above graph was done in Excel by manually calculating the proportions of the different segments of each bar. All my attempts to achieve this in Stata has either resulted in separate bars for the secondary grouping variable - or incorrect results, like the one below:
graph bar (mean) Mort30d, over(TriageLevel) over(EDLOSGroup)stack asyvars Array

The problem is that I want the bar height to be based only on the mean Mort30d. The colors of the bars should represent the distribution of values in TriageLevel but only for that small part of the dataset that the bar is representing. I hope that the images makes it somewhat clear what I'm trying to accomplish.

Continuous line cumulative frequency graph

$
0
0
Hello Forum,

I'm creating a cumulative frequency graph, but the visual result is not like expected. How do I create a continuous line without these “outgrowths”?

I appreciate any help

Ben


Array

Complicated loop with sizes of households and income

$
0
0
Hello,

I am trying to calculate the number of people that are in a low income family under the low income cut off. The low income cut (LICO in Canada) off varies by year and by number of people in the family. I have 20 years in my data, tiny sample of exporting it using dataex looks like this:


Code:
* Example generated by -dataex-. For more info, type help dataex
clear
input int taxyear str5 id float(fam_inc fam_size)
2002 "M0311" 38855 4
2002 "S3172" 38855 4
2016 "M2263" 28061 3
2016 "M2494" 26052 9
2016 "S9925" 26052 3
2017 "M2966"  7104 3
2017 "M3977"  1164 2
2017 "M6388" 33560 4
2018 "O9499" 32551 4
2018 "M6311" 32551 4
end

I want to put the following code into a loop to drop claimants if they are above the low income cut-off, not to have to copy and paste 20 times, and have to write out each cut-off under each family size (2 to 10 family members)


* 2016
drop if fam_size ==2 & taxyear==2016 & fam_inc > 32084
drop if fam_size ==3 & taxyear==2016 & fam_inc > 39295
drop if fam_size ==4 & taxyear==2016 & fam_inc > 45374
drop if fam_size ==5 & taxyear==2016 & fam_inc > 50730
drop if fam_size ==6 & taxyear==2016 & fam_inc > 55572
drop if fam_size ==7 & taxyear==2016 & fam_inc > 60024
drop if fam_size ==8 & taxyear==2016 & fam_inc > 64169
drop if fam_size ==9 & taxyear==2016 & fam_inc > 68061
drop if fam_size ==10 & taxyear==2016 & fam_inc > 71743

* 2017
drop if fam_size ==2 & taxyear==2017 & fam_inc > 33076
drop if fam_size ==3 & taxyear==2017 & fam_inc > 40509
drop if fam_size ==4 & taxyear==2017 & fam_inc > 46776
drop if fam_size ==5 & taxyear==2017 & fam_inc > 52297
drop if fam_size ==6 & taxyear==2017 & fam_inc > 57289
drop if fam_size ==7 & taxyear==2017 & fam_inc > 61879
drop if fam_size ==8 & taxyear==2017 & fam_inc > 66151
drop if fam_size ==9 & taxyear==2017 & fam_inc > 70164
drop if fam_size ==10 & taxyear==2017 & fam_inc > 73959

* ETC FOR 20 years

I have the 20 years of family income cut-offs for each family size. Here is a sample of the table of thresholds for how much the family income should be and family size for a person to fall under the low family income threhsold:
Household size5 2 persons 3 persons 4 persons 5 persons 6 persons 7 persons 8 persons 9 persons 10 persons
2007 25066 30699 35448 39632 43415 46893 50131 53172 56048
2008 26490 32443 37462 41884 45881 49558 52979 56193 59233
2009 26695 32694 37752 42208 46237 49941 53389 56628 59691
2010 27208 33323 38478 43020 47126 50902 54416 57717 60839


Any help would be appreciated! Thank you!


What is a &quot;unit change&quot; in a regressor that is a continuous index with values between 0 and 1

$
0
0
Hello Statalist!

I am working on a project where I am measuring employment precarity as a multidimensional index using the Alkire-Foster methodology. So each dimension has a weight w, and each dimension is made up of one or more indicators I. The indicators are dummies, equal to 1 if the individual is deprived in the indicator and 0 otherwise. Assuming there are 3 indicators, the multidimensional index is computed as

index = w1I1 + w2I2+ w3I3

This means that index is between 0 and 1. In terms of the my study, 0 means no employment precarity, and 1 means the individual is deprived in all the dimensions of precarity (which is rarely the case). My question is this: if I want to interpret the coefficient of index in a regression, would it be right to say a "unit increase" (from 0 to 1) in the index is associated with x units change in the outcome?

I am a bit confused because the index is continuous, and it may not be appropriate to consider a move from one end of the spectrum to another as a unit change.

Thanks for your help!

How to calculate sample size in experiment w. two-level independent and moderating variables

$
0
0
Hi!
I'm conducting a survey-experiment, and I would like to calculate the required sample size, but I'm unsure how to do that.

I will just here provide a breif context

First 50% of alle the respondents get treatment A (an intervention) and the other 50% get no intervention (control). Hereafter the respondents are presented three cases that can either be male or female. Hereafter the respondents are asked to rate the three cases on a scale of 1-3 (dependent variabel). I can thus create 6 observations pr. respondent in terms of their ranking of Case A vs. B, A vs. C, B vs. C (those will be clustered in the regression analysis)...
In sum I'm making an analysis with two levels on the independent variable (male vs. female) and two levels on the moderating variable (intervention or no intervention). The dependent variabel will also be binary comparing the ranking of the cases respectively.

I want my significane level to be 0,05 and power to be 0,8. I don't know the effect size or variance yet, but I expect both be small.

Can Stata help me calculate the sample size I need for this experiment, and if so how? (I have seen some videos on how to calculate sample sizes for e.g., one-way analysis, but I haven't been able to find any examples that fits my concrete case)

Thanks in advance!

/Amalie

Statistical significance different: predicted probability vs average marginal effect differences

$
0
0
I am currently writing up some of my findings (based on multiply imputed data) and have noticed some discrepancies in the statistical significance I find:

When using the below code to calculate differences in average marginal effects of two binary variables, I do not find a statistically significant difference...
Code:
mimrgns x1, dydx(x2) predict(pr) pwcompare
...but I do find one when calculating differences in predicted probabilities (even when looking at 99% CIs) using the below code:
Code:
mimrgns i.x1##i.x2, predict(pr) pwcompare
To add to this, I do find a statistically significant difference when looking at the log-odds in the regression table.

I am therefore wondering which indicator of statistical significance I should focus on; i.e. can I conclude there are statistically significant group differences based on the comparison of predicted probabilities (and the log-odds), or should I conclude I do not find statistically significant differences based on the AMEs?

&quot;Use in&quot; error within &quot;while&quot; loop

$
0
0
Hi all,

I'm experiencing a curious problem while trying to loop through a particularly large dataset. I'm trying to compress and clean the data a million observations at a time, to ensure I don't go above my computer's memory capacity.

My first iteration (when i = 1 and interval_start, interval_end are 1 and 1000000, respectively) works fine, but when the loop starts again I get error stating "using required". Why does it work the first time but not the second time? I know it successfully completes the second compress of the first iteration, and saves the first dataset, as this is the output I get.
" variable concepts10 was str72 now str69
variable concepts11 was str73 now str72
variable concepts15 was str108 now str72
variable concepts17 was str108 now str72
variable concepts20 was str74 now str72
variable concepts23 was str85 now str72
variable concepts24 was str85 now str75
variable concepts25 was str73 now str72
variable concepts26 was str43 now str41
variable concepts27 was str36 now str34
variable concepts28 was str19 now str1
(82,895,562 bytes saved)
file OpenAlex_pull_p1.dta saved


1000001
2000000
2

using required "


The code is included below, as well as a visual example of my data. (I'm sorry it's not in a good format - dataex was giving me a "data width (579 chars) exceeds max linesize. Try specifying fewer variables" error. The exact nature of the data is also less material than the nature of the code.)

I'd appreciate any help or advice you could give!


cd $pull_data
describe using OpenAlex_pull
local num_obs = `r(N)'
display `num_obs'
local interval_start = 1
local interval_end = 1000000
local done = 0
local i = 1
while `done' != 1 {
display `interval_start'
display `interval_end'
clear all
display `i'
use in `interval_start'/`interval_end' using OpenAlex_pull, clear
capture drop multiple_concepts
local interval_start `interval_end' + 1
display `interval_start'
local interval_end `interval_end' + 1000000
display `interval_end'
compress
split concepts, p(",")
des, short
local n_vars `r(k)'
local n_concept_vars = `n_vars' - 12
gen keep = 0
forvalues j = 1/`n_concept_vars' {
display `j'
replace keep = 1 if substr(concepts`j', 1, 7) == "Physics" & (substr(concepts`j', -4, 1) == "9" | substr(concepts`j', -5, 1) == "1") // Identify those obs which have a Physics rating of 90-99% or 100%, respectively
}
drop if keep != 1
cd $pull_data
compress
save OpenAlex_pull_p`i', replace
local i `i' + 1
if `interval_end' > `num_obs' {
local done = 1
}
}

search_name concepts
A JORISSEN Physics/0/94.4,Astronomy/1/87.9,Astrophysics/1/87.5,Computer science/0/85.2,Computer vision/1/77.9,Stars/2/77.7,Quantum mechanics/1/65.6,Mathematics/0/38.5,Spectral line/2/31.7,Galaxy/2/24.4,Biology/0/22.5,Binary number/2/21.9,Arithmetic/1/21.9,Chemistry/0/20.8,Geography/0/20.2,

A JORISSEN Art/0/29.8,Philosophy/0/26.3,History/0/21.1,Physics/0/21.1,

A JORISSEN Physics/0/100.0,Astronomy/1/50.0,Geometry/1/50.0,Combinatorial chemistry/1/50.0,Theology/1/50.0,Mathematics/0/50.0,Biochemistry/1/50.0,Quantum mechanics/1/50.0,Stereochemistry/1/50.0,Biology/0/50.0,History/0/50.0,Thermodynamics/1/50.0,Mathematical analysis/1/50.0,Philosophy/0/50.0,Art/0/50.0,Medicinal chemistry/1/50.0,Multiplicity (mathematics)/2/50.0,Catalysis/2/50.0,Archaeology/1/50.0,Component (thermodynamics)/2/50.0,Dipole/2/50.0,Organic chemistry/1/50.0,Chemistry/0/50.0,Geography/0/50.0,Treasure/2/50.0,

A JORISSEN Astronomy/1/100.0,Computer vision/1/100.0,Computer science/0/100.0,Astrophysics/1/100.0,Quantum mechanics/1/100.0,Physics/0/100.0,Supernova/2/100.0,Stars/2/100.0,Asymptotic giant branch/3/100.0,Nucleosynthesis/3/50.0,Spectral line/2/50.0,Orbital period/3/50.0,Mathematics/0/50.0,Binary number/2/50.0,Giant star/3/50.0,Arithmetic/1/50.0,Galaxy/2/50.0,Metallicity/3/50.0,Stellar evolution/3/50.0,s-process/4/50.0,Binary system/3/50.0,Atomic physics/1/50.0,Nuclear physics/1/50.0,Neutron star/2/50.0,White dwarf/3/50.0,

A JORISSEN Physics/0/91.7,Mechanics/1/83.3,Engineering/0/83.3,Mathematics/0/75.0,Mechanical engineering/1/66.7,Geometry/1/58.3,Geology/0/58.3,Materials science/0/58.3,Flow (mathematics)/2/50.0,Thermodynamics/1/50.0,Geomorphology/1/50.0,Aerospace engineering/1/50.0,Venturi effect/3/41.7,Computer science/0/41.7,Nozzle/2/41.7,Oceanography/1/41.7,Discharge coefficient/3/41.7,Inlet/2/41.7,Biology/0/33.3,Meteorology/1/33.3,Composite material/1/33.3,Economics/0/33.3,Reynolds number/3/33.3,Turbulence/2/33.3,Geography/0/33.3,

A JORISSEN Computer science/0/100.0,Astronomy/1/50.0,Information retrieval/1/50.0,Computer vision/1/50.0,Mathematics/0/50.0,Environmental science/0/50.0,Astrophysics/1/50.0,Quantum mechanics/1/50.0,Galaxy/2/50.0,Statistics/1/50.0,Physics/0/50.0,Mathematical analysis/1/50.0,Stars/2/50.0,Survey data collection/2/50.0,Milky Way/3/50.0,Content (measure theory)/2/50.0,

A JORISSEN Computer science/0/100.0,Quantum mechanics/1/100.0,Physics/0/100.0,Astronomy/1/50.0,Remote sensing/1/50.0,Thermodynamics/1/50.0,Optics/1/50.0,Geology/0/50.0,Interferometry/2/50.0,Component (thermodynamics)/2/50.0,Geography/0/50.0,

Merging panel data with non panel data

$
0
0
Is it possible to merge panel data (multiple observations for each individual) with data that isn't panel data (only one observation per individual)?
I have tried to do so but get the error message that my identifier doesn't uniquely identify observations.
So is it possible to merge these types of data together?

Problem with copying variables in Browse Mode on Stata/BE 18.0

$
0
0
Dear Statalist community,

I have recently started using Stata/BE 18.0 after having previously used Stata 16 and I've found a small difference in the Browse Mode.
Previously with Stata 16, I was able to highlight the variables I was interested in from the "Variables" window in Browse Mode and use the "Ctrl+C" shortcut to copy the highlighted variable(s).
Doing so in Stata 18 however, copies the observation(s) highlighted instead of the variable(s).
To achieve the same in Stata 18, I have to use the mouse to right-click and select "Copy varlist", although the shortcut is still displayed in the dropdown.
I've tried looking around the program to see if there's any way to toggle this on and off but haven't been able to find any solutions.
If anyone has any solutions or suggestions as to how I can solve this quality of life fix, it would be greatly appreciated.

Thank you in advance,
Pin-Yen

Multilevel modeling with cross-sectional survey data and vce(brr)

$
0
0
Hello, I am using the Household Pulse Survey data (US Census Bureau) to analyze food insecurity trends in states with grocery taxes during the pandemic. I have created the dataset and weighted it per the US Census Bureau's office of data management direction.
I would like to perform a multi-level analysis of this data to account for household and contextual (state-level) characteristics using the melogit command, on a subpopulation of the dataset that does not include SNAP recipients.

Here is an example of my dataset only including variables of interest:
Code:
* Example generated by -dataex-. For more info, type help dataex
clear
input byte cconfINsuf float(reweek rstate) byte rrace2
0 18 0 1
0 18 1 1
1 18 1 3
1 18 1 1
1 18 1 2
1 18 1 1
0 18 1 1
0 18 1 1
0 18 1 1
0 18 1 1
0 18 0 1
0 18 0 1
0 18 0 1
0 18 0 1
0 18 0 1
0 18 0 1
0 18 0 1
1 18 0 3
0 18 0 1
. 18 0 1
end
label values cconfINsuf cconfINsuf
label def cconfINsuf 0 "Enough Food", modify
label def cconfINsuf 1 "Not Always Enough Food", modify
label values rrace2 rrace2
label def rrace2 1 "White", modify
label def rrace2 2 "Black", modify
label def rrace2 3 "Hispanic", modify
CODE for weighting procedure and melogit regression:

svyset [iw=rpweight], brrweight(r2pweight*) fay(0.5) vce(brr) mse


svy brr, subpop(rsnap): melogit cconfINsuf i.rrace2 i.income i.rtenure i.ranywork i.gendA i.reduc i.rms i.rthhld_num i.rthhld_numkid i.reweek##i.r2state || lev2: state

RESULTS:

melogit is not supported by svy with vce(brr); see help svy estimation for a list of Stata estimation commands that are supported by svy
r(322);


The help svy estimation recommended in the error code provides a list of STATA estimation commands that includes melogit, although using melogit with this survey design continues to produce this error.
In STATA, it seems that multilevel modeling cannot accommodate complex survey design using replicate weights.



Viewing all 65664 articles
Browse latest View live


Latest Images

<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>