Hi all,
I'm experiencing a curious problem while trying to loop through a particularly large dataset. I'm trying to compress and clean the data a million observations at a time, to ensure I don't go above my computer's memory capacity.
My first iteration (when i = 1 and interval_start, interval_end are 1 and 1000000, respectively) works fine, but when the loop starts again I get error stating "using required". Why does it work the first time but not the second time? I know it successfully completes the second compress of the first iteration, and saves the first dataset, as this is the output I get.
" variable concepts10 was str72 now str69
variable concepts11 was str73 now str72
variable concepts15 was str108 now str72
variable concepts17 was str108 now str72
variable concepts20 was str74 now str72
variable concepts23 was str85 now str72
variable concepts24 was str85 now str75
variable concepts25 was str73 now str72
variable concepts26 was str43 now str41
variable concepts27 was str36 now str34
variable concepts28 was str19 now str1
(82,895,562 bytes saved)
file OpenAlex_pull_p1.dta saved
1000001
2000000
2
using required "
The code is included below, as well as a visual example of my data. (I'm sorry it's not in a good format - dataex was giving me a "data width (579 chars) exceeds max linesize. Try specifying fewer variables" error. The exact nature of the data is also less material than the nature of the code.)
I'd appreciate any help or advice you could give!
cd $pull_data
describe using OpenAlex_pull
local num_obs = `r(N)'
display `num_obs'
local interval_start = 1
local interval_end = 1000000
local done = 0
local i = 1
while `done' != 1 {
display `interval_start'
display `interval_end'
clear all
display `i'
use in `interval_start'/`interval_end' using OpenAlex_pull, clear
capture drop multiple_concepts
local interval_start `interval_end' + 1
display `interval_start'
local interval_end `interval_end' + 1000000
display `interval_end'
compress
split concepts, p(",")
des, short
local n_vars `r(k)'
local n_concept_vars = `n_vars' - 12
gen keep = 0
forvalues j = 1/`n_concept_vars' {
display `j'
replace keep = 1 if substr(concepts`j', 1, 7) == "Physics" & (substr(concepts`j', -4, 1) == "9" | substr(concepts`j', -5, 1) == "1") // Identify those obs which have a Physics rating of 90-99% or 100%, respectively
}
drop if keep != 1
cd $pull_data
compress
save OpenAlex_pull_p`i', replace
local i `i' + 1
if `interval_end' > `num_obs' {
local done = 1
}
}
search_name concepts
A JORISSEN Physics/0/94.4,Astronomy/1/87.9,Astrophysics/1/87.5,Computer science/0/85.2,Computer vision/1/77.9,Stars/2/77.7,Quantum mechanics/1/65.6,Mathematics/0/38.5,Spectral line/2/31.7,Galaxy/2/24.4,Biology/0/22.5,Binary number/2/21.9,Arithmetic/1/21.9,Chemistry/0/20.8,Geography/0/20.2,
A JORISSEN Art/0/29.8,Philosophy/0/26.3,History/0/21.1,Physics/0/21.1,
A JORISSEN Physics/0/100.0,Astronomy/1/50.0,Geometry/1/50.0,Combinatorial chemistry/1/50.0,Theology/1/50.0,Mathematics/0/50.0,Biochemistry/1/50.0,Quantum mechanics/1/50.0,Stereochemistry/1/50.0,Biology/0/50.0,History/0/50.0,Thermodynamics/1/50.0,Mathematical analysis/1/50.0,Philosophy/0/50.0,Art/0/50.0,Medicinal chemistry/1/50.0,Multiplicity (mathematics)/2/50.0,Catalysis/2/50.0,Archaeology/1/50.0,Component (thermodynamics)/2/50.0,Dipole/2/50.0,Organic chemistry/1/50.0,Chemistry/0/50.0,Geography/0/50.0,Treasure/2/50.0,
A JORISSEN Astronomy/1/100.0,Computer vision/1/100.0,Computer science/0/100.0,Astrophysics/1/100.0,Quantum mechanics/1/100.0,Physics/0/100.0,Supernova/2/100.0,Stars/2/100.0,Asymptotic giant branch/3/100.0,Nucleosynthesis/3/50.0,Spectral line/2/50.0,Orbital period/3/50.0,Mathematics/0/50.0,Binary number/2/50.0,Giant star/3/50.0,Arithmetic/1/50.0,Galaxy/2/50.0,Metallicity/3/50.0,Stellar evolution/3/50.0,s-process/4/50.0,Binary system/3/50.0,Atomic physics/1/50.0,Nuclear physics/1/50.0,Neutron star/2/50.0,White dwarf/3/50.0,
A JORISSEN Physics/0/91.7,Mechanics/1/83.3,Engineering/0/83.3,Mathematics/0/75.0,Mechanical engineering/1/66.7,Geometry/1/58.3,Geology/0/58.3,Materials science/0/58.3,Flow (mathematics)/2/50.0,Thermodynamics/1/50.0,Geomorphology/1/50.0,Aerospace engineering/1/50.0,Venturi effect/3/41.7,Computer science/0/41.7,Nozzle/2/41.7,Oceanography/1/41.7,Discharge coefficient/3/41.7,Inlet/2/41.7,Biology/0/33.3,Meteorology/1/33.3,Composite material/1/33.3,Economics/0/33.3,Reynolds number/3/33.3,Turbulence/2/33.3,Geography/0/33.3,
A JORISSEN Computer science/0/100.0,Astronomy/1/50.0,Information retrieval/1/50.0,Computer vision/1/50.0,Mathematics/0/50.0,Environmental science/0/50.0,Astrophysics/1/50.0,Quantum mechanics/1/50.0,Galaxy/2/50.0,Statistics/1/50.0,Physics/0/50.0,Mathematical analysis/1/50.0,Stars/2/50.0,Survey data collection/2/50.0,Milky Way/3/50.0,Content (measure theory)/2/50.0,
A JORISSEN Computer science/0/100.0,Quantum mechanics/1/100.0,Physics/0/100.0,Astronomy/1/50.0,Remote sensing/1/50.0,Thermodynamics/1/50.0,Optics/1/50.0,Geology/0/50.0,Interferometry/2/50.0,Component (thermodynamics)/2/50.0,Geography/0/50.0,