Globus for data transfer
Introduction
Goals To learn about transferring data to the HPC using Globus
Learning Objectives This lesson contributes to the objective to "operate the High-Performance Computing Resources" by "describing data transfer options for the HPC"
Globus
- This lesson focuses on data transfer. You can see the many available tools listed on the official HPC Data Transfer Overview page
- You have already uploaded and downloaded small files using OOD, but you need a more capable tool to transfer a lot of data.
- GLOBUS is such a tool:
- It uses GRID FTP to optimize data transfer by improving large file handling, transfer speeds, and reliability as compared to SCP or SFTP.
- It verifies the correctness of transferred files, restarts dropped connections and can sync directories (only transfer what's changed), and
- It has a graphical user interface and a command-line interface.
What is Globus
Globus is used by national labs and computing clusters, including the U of A HPC.
Globus transfers data between endpoints (e.g., the HPC and another computer, or even two HPC directories); the endpoints are displayed as panels in a browser window. You will use Globus in subsequent lessons, and in your future work, to transfer files to and from the HPC.
This 3.5-minute video introduces Globus for transferring files.
Archiving and Data transfer
- If you have lots of files and directories, Globus slows to a crawl. For example, a Globus transfer of my ~100 GBs of data: (150 subjects in a directory containing BedpostX results) was less than half done a day later!
- Unfortunately, zipping or tarring the datasets takes some time and you’ll have to run such archiving procedures as interactive jobs.
- Here is my experience: Tar alone for the 100 GB directory (no zipping) took 7-minutes to archive and resulted in a 92.75 GB file:
tar -cvf fsl_dwi_proc.tar fsl_dwi_proc
. - Tar with zip (z) took 1 hour and 16 minutes to archive, and resulted in a 90.99 GB file:
tar -cvzf fsl_dwi_proc.tar fsl_dwi_proc
. - Thus tar+zip saved me only ~2% on file size, but took almost 11 times longer to archive!
- Zip by itself took a long time as well. Given that the NIfTI images are all gzipped anyhow, adding
z
to the tar command just adds time and complexity to the command without much benefit. - Using tar without the
z
is the way to go for our NIfTI data. After the data is archived, the file took only minutes to transfer, instead of days!
This 3.5-minute video explains the importance of archiving your data before transferring it.
Globus Demonstration
Once you install Globus and define bookmarks, you can upload the dataset. You'll learn how to do this in the accompanying practice lesson. Watch this silent 2.25-minute demonstration
Data Transfer Tips
These tips should help you avoid, or recover from, corrupted data.
- Keep in mind that the HPC cannot save your data. They do not back up groups and xdisk, so you use those at your own risk.
- Transfer and check data integrity often, so you have backups if something goes wrong
- Use cp instead of mv, so you retain the original until you know the new one is okay.
- Use RDAS to store intermediate copies of your files
- Use Globus to move files between directories on the HPC. You may be able to avoid tarring altogether.
- If you tar neuroimaging data, avoid zipping it, and keep the tarballs relatively small (large files are increasingly likely to have corruptions)
- If you are tarring data, make sure you are in interactive mode. The regular login terminal will not support big operations, so better safe than sorry.
- If a tar file is corrupted, it is virtually impossible to recover, so check that you can extract it before you remove the original data.
In short: Use multiple approaches to keeping your data safe! Do NOT put all your eggs in one basket.

Resources
Digest
Summary
The University of Arizona pays to provide you with free access to Globus. Globus can efficiently transfer data between your computer and the HPC, or even between directories on the HPC. If you end up using Soteria, Globus also works for that HIPAA-protected storage.
Globus, and all transfer programs, incur considerable overhead for each separate file transfer, so it is best to archive lots of small files into a single large file before transfer.
In addition, soft links in a directory are NOT preserved by Globus (unless they are protected in an archive before transfer).
Topics
Globus
Globus bookmark
Globus endpoint
GRID FTP
gzip
zip
tar
links and file transfer
transfer overhead for lots of small files
TO DO
Encouraged
In the Discussion: HPC
✅ Propose a topic based on what was difficult or what you would like to know more about.
✅ Did you find a resource (e.g., website, video, article) that was helpful? Share the link and tell us why you liked it.
✅ Read your classmates' posts and provide feedback.