Quick Links
|
There are several secure options for transferring files to and from Biowulf and Helix. Detailed setup & usage instructions for each method are below.
No matter how you transfer data in and out of the systems, be aware that PII and PHI data cannot be stored or transferred into the NIH HPC systems.
Globus is a service that makes it easy to move, sync, and share large amounts of data. It is the recommended way to transfer data to and from the HPC systems.
Globus will manage file transfers, monitor performance, retry failures, recover from faults automatically when possible, and report the status of your data transfer. Globus uses GridFTP for more reliable and high-performance file transfer, and will queue file transfers to be performed asynchronously in the background.
Setting up a Globus account, transferring and sharing data
Interactive Data Transfers should be performed on helix.nih.gov, the designated system for interactive data transfers and large-scale file manipulation. (An interactive session on a Biowulf compute node is also appropriate). Such processes should not be run on the Biowulf login node. For example, tarring and gzipping a large directory, or rsyncing data to another server, are examples of such interactive data transfer tasks.
The HPC System Directories, which include /home, /data, and /scratch, can be mounted to your local workstation if you are on the NIH network or VPN, allowing you to easily drag and drop files between the two places. Note that this is most suitable for transferring small file. Users transferring large amounts of data to and from the HPC systems should continue to use scp/sftp/globus.
Mounting your HPC directories to your local system is particularly userful for viewing HTML reports generated in the course of your analyses on the HPC systems. For these cases, you should be able to navigate to and select the desired html file to open them in your local system's web browser.
Directions for Locally mounting HPC System Directories
Download from winscp.net and install it. Administrator privilege may be needed.
To open WinSCP, click on the search icon at the bottom left corner on your desktop. Type 'winscp', double click on it to open.
Select 'sFTP', fill the host name as helix.nih.gov, your NIH login username and password, then click 'Login'.
Click 'Yes'. This window only show up the first time you use WinSCP.
Click 'continue' in the authentication bar.
The left panel shows the directories on your desktop PC and the right panel shows your directories on Biowulf.
Click on the 'Preference' icon and browse through the tags to get an idea of all the options available.
To locate the file source and destination, simply use the two drop down boxes. Drag and drop files or folders to start transfer.
Fugu is a graphical frontend to the commandline Secure File Transfer application (SFTP). SFTP is similar to FTP, but unlike FTP, the entire session is encrypted, meaning no passwords are sent in cleartext form, and is thus much less vulnerable to third-party interception. Fugu allows you to take advantage of SFTP's security without having to sacrifice the ease of use found in a GUI. Fugu also includes support for SCP file transfers, and the ability to create secure tunnels via SSH.
Download Fugu from the U. Mich. Fugu website.
Doubleclick on the downloaded Fugu_xxxx.dmg file to open. A small window with the Fugu icon will appear,
Grab the fish and copy it to your Applications folder, your Desktop and/or your Dock.
Start Fugu by clicking on the Fugu icon. In the box for 'Connect to:', enter 'helix.nih.gov' and click 'Connect'. Enter your NIH Login password when requested. You should now see a window with one pane listing files on your local desktop machine, and the other pane listing files in your Biowulf/Helix account space.
Both psftp and pscp are run through the Windows console (Command Prompt in start menu), and require the directory to the PuTTY executables be included in the Path environment variable. This can be done transiently through the console:
or permanently through the System Control Panel (see here for more information).
Secure Copy (pscp) is a command line mechanism for copying files to and from remote systems.
From the console, type 'pscp'. This will bring up a help menu showing all the options for pscp.
PuTTY Secure Copy client Release 0.58 Usage: pscp [options] [user@]host:source target pscp [options] source [source...] [user@]host:target pscp [options] -ls [user@]host:filespec Options: -V print version information and exit -pgpfp print PGP key fingerprints and exit -p preserve file attributes -q quiet, don't show statistics -r copy directories recursively -v show verbose messages -load sessname Load settings from saved session -P port connect to specified port -l user connect with specified user -pw passw login with specified password -1 -2 force use of particular SSH protocol version -4 -6 force use of IPv4 or IPv6 -C enable compression -i key private key file for authentication -batch disable all interactive prompts -unsafe allow server-side wildcards (DANGEROUS) -sftp force use of SFTP protocol -scp force use of SCP protocol
To copy a file from the local Windows machine to a user's home directory on Helix, type
You will be prompted for your NIH login password, then the file will be copied.
To do the reverse, i.e. copy a remote file from helix to the local Windows machine, type
(you must include a '.' to retain the same filename, or explicitly give a name for the remotefile copy).
Secure FTP (psftp) allows for interactive file transfers between machines in the same way as good old FTP (non-secure) did.
From the console, type 'psftp'. This will start a sFTP session, but it will complain that no connection has been made. To transfer a local file to helix, at the psftp prompt type:
You will again be prompted for a password.
Once a session to helix has been established, the standard FTP commands can be used.
For even more information, see https://www.chiark.greenend.org.uk/~sgtatham/putty/
scp is a secure, encrypted way to transfer files between machines. It is available on Macs and Unix/Linux machines. Transfers should not be performed on the Biowulf login node, as they will be subject to automatic termination if they use more than a little CPU, memory or walltime. Instead, use Helix for interactive data transfers. Since Helix and Biowulf share the same /home and /data areas, any files you transfer to Helix will also be available on Biowulf in the same path.
To transfer a file from your local machine to the HPC systems (Helix/Biowulf), any of the following will work:
desktop% scp myfile $USER@helix.nih.gov:/path/to/desired/dir/
desktop% scp $USER@helix.nih.gov:/path/to/desired/dir/file /path/to/local/dir
All of the above methods will avoid use of the Biowulf login node.
If your Helix account is locked due to inactivity, you can unlock it yourself at the Dashboard.
You may want to automatically transfer your generated results back to your local system at the end of a Biowulf batch job.
Command-line transfer as part of a batch job
First you should get familiar with the Globus command-line interface.
Then add something like the following at the end of your Biowulf batch job:
#!/bin/bash # process your data ..... some batch job commands .... # now set up a Globus command-line transfer to copy the results back to your local system globus transfer --recursive \ e2620047-6d04-11e5-ba46-22000b92c6ec:/data/user/mydir/ d8eb36b6-6d04-11e5-ba46-22000b92c6ec:/data1/myoutput/ \The output from the last line of this batch script, which will appear in the usual slurm-#####.out output file, will be a Globus task id of the form
Task ID: 2fdd385c-bf3e-11e3-b461-22000a971261
helix% module load aws helix% aws configure (see here for docs) ### For anonymous download, leave None for the prompts helix% aws s3 ls s3://mybucket #list contents of bucket helix% aws s3 sync help helix% aws s3 sync s3://mybucket /data/$USER/mydir (download a full bucket) helix% aws s3 cp s3://BUCKETNAME/PATH/TO/FOLDER /data/$USER/mydir --recursive (download a folder in a bucket) helix% aws --no-sign-request s3 cp s3://BUCKETNAME/PATH/TO/FOLDER /data/$USER/mydir (download a file anonymously in a bucket)
helix% module load google-cloud-sdk helix% gcloud init # get send back a link that you have to paste in your browser and then you can select # your google account that is linked to the bucket helix% gsutil -m cp gs://my_bucket/* .Note: the -m flag is for multithreading/multi-processing. The number of threads/processes is set by the flags parallel_thread_count and parallel_process_count in your boto config file. You can find the appropriate config file by typing
helix% gsutil version -lRecommended values are:
Some sources of biological data have specialized tools for file transfer.
NCBI makes a large amount of data available through the NCBI ftp site, and also provides most or all of the same data on their Aspera server. Aspera is a commercial package that has considerably faster download speeds than ftp. More details in the NCBI Aspera Transfer Guide.
Note that SRA or dbGaP downloads are better done via the SRAtoolkit.
You do not need to load any modules. The 'ascp' command is available on Helix by default. If desired, you can set an alias for ascp that includes the key, e.g
alias ascp="/usr/bin/ascp -i /opt/aspera/asperaweb_id_dsa.openssh"
Sample session (user input in bold):
helix% ascp -T -i /opt/aspera/asperaweb_id_dsa.openssh -l 300M \ anonftp@ftp-trace.ncbi.nlm.nih.gov:/snp/organisms/human_9606/ASN1_flat/ds_flat_ch1.flat.gz \ /scratch/$USER ds_flat_ch1.flat.gz 100% 5523MB 291Mb/s 02:41 Completed: 5656126K bytes transferred in 161 seconds
If your download stops before completion, you can use the -k2 flag to resume transfers without re-downloading all the data. e.g.
helix% ascp -T -i /opt/aspera/asperaweb_id_dsa.openssh -k2 -l500M \ anonftp@ftp-trace.ncbi.nlm.nih.gov:/snp/organisms/human_9606/ASN1_flat /data/user/ ds_flat_ch1.flat.gz 100% 323MB 0.0 b/s 00:03 [...] ds_flat_chPAR.flat.gz 100% 7742KB 402 b/s 00:01 ds_flat_chUn.flat.gz 100% 39MB 107Mb/s 00:00 ds_flat_chX.flat.gz 100% 104MB 196Mb/s 00:18 ds_flat_chY.flat.gz 100% 14MB 3.3Mb/s 04:59 Completed: 1706213K bytes transferred in 301 seconds (46432K bits/sec), in 30 files, 1 directory.
Typical file transfer rates from the NCBI server are 400 - 500 Mb/s, so '-l500M' is the recommended value.
By clicking on the icon in the transfer manager window, you can open the Transfer Monitor which will show a more detailed graph of the transfer rate
On Helix or Biowulf, use ftp ftp.ncbi.nlm.nih.gov to access the NCBI ftp site. Sample session (user input in bold):
helix% ftp ftp.ncbi.nlm.nih.gov Connected to ftp.wip.ncbi.nlm.nih.gov. 220- Warning Notice! [...] --- Welcome to the NCBI ftp server! The official anonymous access URL is ftp://ftp.ncbi.nih.gov Public data may be downloaded by logging in as "anonymous" using your E-mail address as a password. Please see ftp://ftp.ncbi.nih.gov/README.ftp for hints on large file transfers 220 FTP Server ready. 500 AUTH not understood 500 AUTH not understood KERBEROS_V4 rejected as an authentication type Name (ftp.ncbi.nlm.nih.gov:user): anonymous 331 Anonymous login ok, send your complete email address as your password. Password: 230 Anonymous access granted, restrictions apply. Remote system type is UNIX. Using binary mode to transfer files. ftp> cd blast/db/ 250 CWD command successful ftp> get wgs.58.tar.gz local: wgs.58.tar.gz remote: wgs.58.tar.gz 227 Entering Passive Mode (130,14,29,30,195,228) 150 Opening BINARY mode data connection for wgs.58.tar.gz (983101055 bytes) 226 Transfer complete. 983101055 bytes received in 1.3e+02 seconds (7.7e+03 Kbytes/s) ftp> quit 221 Goodbye. helix%
You do not need to load any modules. The 'ascp' command is available on Helix by default. However you need to get the private SSH key file from NCBI. Sample session (user input in bold):
Uploading to SRA:
[NCBI documentation for SRA uploads]
helix% ascp -i /opt/aspera/aspera_tokenauth_id_rsa \ -QT -l 300m -k1 \ -d /path/to/directory \ subasp@upload.ncbi.nlm.nih.gov:uploads/<user@email.com_xxxxx>where xxxxxx is the string provided by NCBI for this upload.
If your download stops before completion, you can use the -k2 flag to resume transfers without re-downloading all the data.
Uploading to dbGaP:
[NCBI documentation for dbGap uploads] To upload to dbGaP, you need to obtain the upload information from NCBI. Then run a command like:
helix% export ASPERA_SCP_PASS=######-#####-#####-### helix% ascp -i /opt/aspera/aspera_tokenauth_id_rsa -Q -l 300m -k 1 \ -d /path/to/directory \ asp-dbgap@gap-submit.ncbi.nlm.nih.gov:/protectedwhere the value of ASPERA_SCP_PASS has been provided by NCBI. Note: Do not use the -T flag for dbGaP uploads.
NCBI's Gene Expression Omnibus (GEO) is a public functional genomics data repository. To submit to GEO, you need to register for an account and obtain the GEO FTP credentials (including your account-specific GEO submission directory).
Sample session (user input in bold, comments after ##):
helix% ## Upload a single file helix% scp sample.fq.gz geoftp@sftp-private.ncbi.nlm.nih.gov:uploads/abc_xyz/ geoftp@sftp-private.ncbi.nlm.nih.gov's password: sample.fq.gz 100% 2672MB 43.8MB/s 01:01 helix% ## Uploading a directory containing GEO submission data helix% scp -r submission_dir geoftp@sftp-private.ncbi.nlm.nih.gov:uploads/abc_xyz/ helix% ## Uploading multiple files matching filenames starting with sample1 helix% scp submission_dir/sample1* geoftp@sftp-private.ncbi.nlm.nih.gov:uploads/abc_xyz/
Sample session (user input in bold):
helix% lftp ftp://geoftp@ftp-private.ncbi.nlm.nih.gov Password: lftp geoftp@ftp-private.ncbi.nlm.nih.gov:/> cd uploads/abc_xyz cd ok, cwd=/uploads/abc_xyz lftp geoftp@ftp-private.ncbi.nlm.nih.gov:/uploads/abc_xyz> mirror -R test_submission_dir Total: 1 directory, 6 files, 0 symlinks New: 6 files, 0 symlinks 17228023193 bytes transferred in 87 seconds (188.58M/s) lftp geoftp@ftp-private.ncbi.nlm.nih.gov:/uploads/abc_xyz> ls drwxrwsr-x 2 geoftp geo 4096 Feb 5 13:57 test_submission_dir lftp geoftp@ftp-private.ncbi.nlm.nih.gov:/uploads/abc_xyz> exit
OpenNeuro.org is a free and open platform for validating and sharing BIDS-compliant MRI, PET, MEG, EEG, and iEEG data. Data can be uploaded directly from Biowulf using the openneuro command-line tool. It is best to do this on Helix, the designated interactive data transfer node.
helix% module load OpenNeuro_cli [+] Loading nodejs [+] Loading OpenNeuro_cli 4.14.3 ... helix% openneuro loginYou will be prompted to choose an OpenNeuro instance (e.g. openneuro.org). You will then be asked to provide an API key.
helix% openneuro upload PATH_TO_BIDS_FOLDERUse the -i flag to ignore warnings.
Note: if you get errors during the upload, you might want to try an older version of OpenNeuro, specifically the 4.12.1 which has worked for some NIH users.
helix% module load OpenNeuro_cli/4.12.1 helix% NODE_OPTIONS=--no-experimental-openneuro upload PATH_TO_BIDS_FOLDERThanks to Lina Teichman, NIMH for testing and providing these commands.
By design, the Biowulf cluster is not connected to the internet. However, files can be transferred to and from the cluster using a Squid proxy server. Click on the link below for more details on how to use the proxy server.
http_proxy ftp_proxy RSYNC_PROXY https_proxyThis includes programs such as wget, curl, lftp, rsync, and git.
[user@cn1875 ~]$ wget http://www.nih.gov --2015-10-01 12:47:48-- http://www.nih.gov/ Resolving dtn02-e0... 10.1.200.238 Connecting to dtn02-e0|10.1.200.238|:3128... connected. Proxy request sent, awaiting response... 200 OK Length: unspecified [text/html] Saving to: "index.html" [ <=> ] 38,836 --.-K/s in 0.002s 2015-10-01 12:47:48 (18.5 MB/s) - "index.html" saved [38836]
[user@cn1875 ~]$ curl -o nih_homepage.html http://www.nih.gov % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 38836 0 38836 0 0 2547k 0 --:--:-- --:--:-- --:--:-- 2917k
[user@cn1875 ~ ]$ lftp ftp.redhat.com lftp ftp.redhat.com:~> ls drwxr-xr-x -- / drwxr-xr-x -- .. lrwxrwxrwx - 2009-12-19 00:00 pub -> . drwxr-xr-x - 2015-03-18 00:00 redhat lftp ftp.redhat.com:~> exit
[user@cn1875 ~ ]$ rsync mirror.umd.edu::centos/timestamp.txt $HOME/tmp
Note: rsync from the compute nodes with the SSH protocol is not supported. Only the rsync protocol is supported (notice the double colon in the command above). Therefore, the following will not work:
[user@cn1875 ~ ]$ rsync server.nih.gov:~/file.txt $HOME/tmp
[user@cn1875 ~ ]$ git clone https://github.com/ncbi/sra-tools.git Initialized empty Git repository in /home/user/sra-tools/.git/ remote: Counting objects: 7447, done. remote: Compressing objects: 100% (137/137), done. remote: Total 7447 (delta 79), reused 0 (delta 0), pack-reused 7309 Receiving objects: 100% (7447/7447), 15.81 MiB | 5.73 MiB/s, done. Resolving deltas: 100% (4868/4868), done.
The rate of data transfer is only an issue for data amounts greater than 256MB. For amounts less than this, any application will suffice. To optimize transfer rates for large amounts of data, use less demanding encryption ciphers, such as blowfish or arcfour, and try to transfer the data when the network is less busy (before 10 am and after 6 pm). Also use the most appropriate application based on the table below.
The HPC Staff has compared the applications and our results are below. For the most part we recommend using Globus for most transfers. scp is the default and best option for Linux/Unix machines.
Platform | Application | Pros | Cons |
---|---|---|---|
All platforms | Globus | Best transfer method. Clients available for all platforms, web-based. Notifications sent on completion. | The client (Globus Connect Personal or Globus Connect Server) must be installed on the non-Biowulf endpoint, which may require admin access to that system. (More info) |
Windows | WinSCP | Much faster transfer rates than PuTTY-pscp/psftp | Cumbersome user interface for changing local and remote directories. |
pscp/psftp | Direct command line control over process. | Need to run through the command prompt, slowest transfer rates seen. | |
Mapped Network Drive | Convenient. | Fairly slow transfer rates, especially very large files. | |
Macs | scp,sftp | Can be used for scripting & automatic file transfers, fastest transfer rates | non-GUI interface. |
Fugu | Easy to configure and use. | Slower than command-line. | |
Mapped Network Drive | Convenient drag-and-drop. | Fairly slow transfer rates, especially for large files. | |
Linux/Unix | scp,sftp | Same as for Macs. | Same as for Macs. |