Stata#
You can process stata DO files on the cluster. It’s easiest to create them in stata on your workstation then upload them to the cluster for processing.
When creating your Stata DO files, it’s worth thinking about breaking the work up into multiple DO files if possible. This means when you come to run them on the cluster you can run multiple simultaneously. For example you may have just one DO file that would take 20 hours to process, if you chopped it up into 10 DO files, it would take just 2hrs. Another benefit is that if for some reason the cluster has problems (node dies or crashes), then it won’t take too long to re-run your job(s).
Once you’ve created your DO file, you upload the the files via SFTP or SCP (see Moving Files) to you home directory. It is highly recommend that you create a new directory for each stata job/project you plan to run. You then login in to the cluster via ssh/putty (see Getting Started) and submit your stata do file via a job script to the job queue.
Single do file script#
In a text editor create the following script in the same directory as your DO file. For this example the scripts filename is myscript, but you can call it anything you like:
#!/bin/bash
#SBATCH --job-name=JOB_NAME
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=email@lshtm.ac.uk
#SBATCH --ntasks=1
#SBATCH --mem=1gb
#SBATCH --time=00:05:00
stata -b do filename_of_do
So if you had a DO file called mywork.do:
#!/bin/bash
#SBATCH --job-name=JOB_NAME
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=email@lshtm.ac.uk
#SBATCH --ntasks=1
#SBATCH --mem=1gb
#SBATCH --time=00:05:00
stata -b do mywork
Then to submit your job:
sbatch myscript
All your output files should be generated in the directory you submitted the job from
Multiple DO files script#
If you have chopped your stata DO file up to create multiple, you will need to name them the same, but with a sequence number appended. For example:
mywork1.do
mywork2.do
mywork3.do
Then in a text editor create the following script in the same directory as your DO files. For this example the scripts filename is myscript, but you can call it anything you like. This example assumes you used the mywork1.do, mywork2.do… naming convention:
#!/bin/bash
#SBATCH --job-name=ARRAY_TEST_JOB
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=email@lshtm.ac.uk
#SBATCH --ntasks=1
#SBATCH --mem=1gb
#SBATCH --time=00:05:00
#SBATCH --array=1-10
stata -b do mywork${SLURM_ARRAY_TASK_ID}
This script as it stands will submit 10 DO files, if you want to change it to say 20, change the 10 in the array line to 20:
#SBATCH --array=1-20
Then submit the job:
sbatch myscript
Single Do File, using SGE_TASK_ID to run multiple tasks#
Submission script:
#!/bin/bash
#SBATCH --job-name=ARRAY_TEST_JOB
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=email@lshtm.ac.uk
#SBATCH --ntasks=1
#SBATCH --mem=1gb
#SBATCH --time=00:05:00
#SBATCH --array=1-10
stata -b do mywork
From within your Stata do script, you assign the SGE_TASK_ID to a local variable:
local x : env SLURM_ARRAY_TASK_ID
You can then use this to choose different data sets to upload, or as a seed.