12

Possible Duplicate:
Free way to share large files over the Internet?
What are some options for transfering large files without using the Internet?

My wife's lab is doing a project here in the US with collaborators in Singapore. They occasionally need to transfer a large amount of high-dimensional image data (~10GB compressed) across continents. With current technologies, what would be a good solution for this usage scenario?

I can think of a few but none of them seems ideal:

  • Direct connection via Internet: transfer rate is about 500KB/s, also lacking a tool to handle errors/retransmissions.
  • Upload to a common server or service such as Dropbox: painful to upload for non-US collaborator.
  • Burning discs or copying to HDs and shipping through Courier: latency is significant, plus the extra work to make a local copy.

Any suggestions?

Update: neither party of the collaboration are tech-savvy users.

Frank
  • 760
  • 7
  • 11
  • Image as in pictures, or image as in a file representing a DVD? – Daniel Beck Dec 02 '11 at 19:48
  • High dimensional images, as generated by microscopes. – Frank Dec 02 '11 at 20:28
  • 1
    So it's several very large files? Could you give us more information regarding file count, individual file size, and how many of those change between transfers? Is it all of them, some of them, etc.? – Daniel Beck Dec 02 '11 at 20:30
  • 1
    Some DNA sequencers have decided that [FedEx is the fastest way to send their prohibitively large amounts of data around the world](http://www.nytimes.com/2011/12/01/business/dna-sequencing-caught-in-deluge-of-data.html). – joshuahedlund Dec 02 '11 at 22:03
  • Sounds like a job for [Sneakernet](http://en.wikipedia.org/wiki/Sneakernet) or [IPoAC](http://en.wikipedia.org/wiki/IP_over_Avian_Carriers). – Naftuli Kay Dec 02 '11 at 23:01
  • Here's a few to check out if you haven't already: [Transfering large files over internet](http://superuser.com/questions/300539/transfering-large-files-over-internet), [Best method of transferring files over internet?](http://superuser.com/questions/119044/best-method-of-transferring-files-over-internet), [Free way to share large files over the Internet?](http://superuser.com/questions/121995/free-way-to-share-large-files-over-the-internet), [Method for transfering large files for newbies](http://superuser.com/questions/123106/method-for-transfering-large-files-for-newbies) – Ƭᴇcʜιᴇ007 Dec 03 '11 at 17:52
  • This comes up a lot in High energy physics. There was a time when the only cost-effective thing to do was write tapes and air freight them. Those days seem to be gone (for now, sometimes these things cycle) and a variety of internet based solutions are used. – dmckee --- ex-moderator kitten Dec 03 '11 at 20:29

6 Answers6

20

I suggest you use rsync. Rsync supports delta-transfer algorithm, so if your files are only partially changed, or if the previous transfer was terminated abnormally, Rsync is smart enough to sync only what's new/changed.

There are several ports of the original Rsync to Windows and other non-unix-compatible systems, both free and non-free. Please see Rsync Wikipedia article for details.

Rsync over SSH is very widely used, and works well. 10GB is relatively small amount of data nowdays, and you didn't specify what "occasionally" means. Weekly? Daily? Hourly? With 500KB/sec transfer rate it will take around 6 hours, not really a long time. If you need to transfer the data frequently, it is probably better to create a cron task to start rsync automatically.

haimg
  • 22,193
  • 16
  • 79
  • 113
  • Doesn't `rsync` require its own protocol for deltas, requiring a capable counterpart system on the other end? – Daniel Beck Dec 02 '11 at 19:47
  • @DanielBeck: There is nothing in the docs that says that rsync over SSH cannot use deltacopy... Basically rsync client executes another rsync copy on the server via ssh, so I don't see why it wouldn't work. – haimg Dec 02 '11 at 19:55
  • +1 You have a point there. That leaves the Linux requirement on the server though? – Daniel Beck Dec 02 '11 at 20:01
  • Does `rsync`'s delta-algorithm work when transferring binary compressed data (`.zip` or `.jpg`)? – Aditya Dec 02 '11 at 20:46
  • @DanielBeck: I've added a link to Wikipedia article with several Windows rsync ports. Apparently at least some of them work as a server, including ssh. I've never used any of them though. – haimg Dec 02 '11 at 21:36
  • @Aditya: Yes. rsync's delta algorithm works with binary data too. So, if there are some common sections between the source and the target file, they will be skipped. However, re-compressing usually changes the archive too much, so delta algorithm is not that effective in this case. – haimg Dec 02 '11 at 21:40
  • Rsync is probably the best option in terms of reliability and minimizing the amount of data transferred but getting any of the windows ports to work properly takes a fair bit of technical knowledge in my experience. Last time I tried I gave up and wrote some scripts that used bit torrent to transfer the files automatically instead. – stoj Dec 02 '11 at 22:21
  • @haimg: There's a patch available for gzip to make it rsync-friendly. [Link](http://stackoverflow.com/questions/2191045/rsync-friendly-gzip) – afrazier Dec 03 '11 at 04:49
12

Connection across the internet can be a viable option and a program such as bittorrent is exactly suited to this purpose as it will break the files up into logical pieces to be sent over the internet to be reconstructed at the other end.

Bittorrent also gives you automatic error correction, repair of damaged pieces and if more people are needing the files then they will get the benefit of being able to be supplied the file from as many sources as already have (parts of) the file downloaded.

Granted people see it as a nice way to download films and such, but the it does have many more legal uses.

A lot of bittorrent clients also have built in trackers so you don't have to have a dedicated server to host the files.

Mokubai
  • 89,133
  • 25
  • 207
  • 233
  • 2
    Thanks for the input. Use of BitTorrent within academic networks may make their administrators nervous. Also, the set up and maintenance of a tracker server may not be that easy for an average computer user. – Frank Dec 02 '11 at 20:35
  • 2
    That is a good point, bittorrent is actively prohibited in many corporate and academic networks. With proper administration though you can set up a white list within networks of users or machines that are allowed to use bittorrent, though this would mean very close ties with respective IT departments to work properly. As I mentioned you do not necessarily need to have a dedicated server as it can be built in to many client programs. If it is not a good fit for your situation though then no worries, it just seemed to me to be reasonable considering your requirements. – Mokubai Dec 02 '11 at 20:46
  • If you were using bitorrent, also using a webseed sounds like a clever idea – Journeyman Geek Dec 02 '11 at 23:51
  • (As an example of one of ‘more legal uses’ mentioned in the answer, Facebook [utilizes](http://www.facebook.com/video/video.php?v=10100259101684977) bittorrent to deploy their site, 1GB binary, to thousands of production servers. How unfortunate that a technology is discarded mostly because of one of its uses.) – Anton Strogonoff Dec 03 '11 at 09:05
6

Split the file up in chunks of e.g. 50MB (using e.g. split). Compute checksums for all of them (e.g. md5sum). Upload directly using FTP and an error-tolerant FTP client, such as lftp on Linux. Transfer all of the chunks and a file containing all checksums.

On the remote site, verify that all the chunks have the desired checksum, reupload those that failed, and reassemble them to the original file (e.g. using cat).

Revert location of server (I posted under the assumption that the destination site provided the server and you start the transfer locally when the files are ready) as needed. Your FTP client shouldn't care.


I have had similar issues in the past and using an error-tolerant FTP client worked. No bits were ever flipped, just regular connection aborts, so I could skip creating chunks and just upload the file. We still provided a checksum for the complete file, just in case.

Daniel Beck
  • 109,300
  • 14
  • 287
  • 334
  • 3
    You need to be aware though that `lftp` does not abort a transfer in progress for *any* reason. Make sure that you always have enough free disk space on the destination site. – Daniel Beck Dec 02 '11 at 19:50
3

A variation of the answer of Daniel Beck is to split up the files in chunks in the order of 50MB to 200MB and create parity files for the whole set.

Now you can transfer the files (including the parity files) with FTP, SCP or something else to the remote site and do a check after arrival of the whole set. Now if there are parts damaged they can be fixed by the parity files if there are enough blocks. This depends more or less on how many files are damaged and how many parity files you created.

Parity files are used a lot on Usenet to send large files. Most of the time they are split up as RAR archives then. It's not uncommon to send data up to 50 to 60GB this way.

You should definitely check out the first link and you could also take a look at QuickPar, a tool that can be used to create parity files, verifies your downloaded files and can even restore damaged files with the provided parity files.

Martijn B
  • 241
  • 2
  • 6
  • +1 - This approach works well on usenet, and the parity files can repair an astonishing amount of missing data. Downside being the processing time required to split and generate parity files and to parity check and extract files after reciept. – deizel. Dec 03 '11 at 05:29
1

Is it one big 10GB file? Could it be easily split up?

I haven't played with this much, but it struck me as an interesting and relatively simple concept that might work in this situation:

http://sendoid.com/

Craig H
  • 1,242
  • 11
  • 13
  • Sendoid is pretty cool, but unfortunately uploading is still going to be painful. Then again, the problem persists for all types I believe, unless you are going to mail a HDD. +1 as it's easy to use. – DMan Dec 03 '11 at 04:08
0

Make the data available via ftp/http/https/sftp/ftps (requiring logon credentials) and use any download manager on the client side.

Download managers are specifically designed to retrieve data regardless of any errors that may occur so they ideally fit your task.

As for the server, an FTP server is typically the easiest to set up. You may consult a list at Wikipedia. HTTPS, SFTP and FTPS allow encryption (in pure FTP/HTTP, password is sent in clear text) but SFTP/FTPS are less commonly supported by client software and HTTP/HTTPS server setup is tricky.

ivan_pozdeev
  • 1,897
  • 18
  • 34
  • 1
    The problem with using http or ftp is that is there are any transmission errors, you have to send the whole thing again. rsync, bittorrent, and other protocols can verify that the files match and only retransmit the damaged pieces. Parity data, like QuickPar generates, can help too. – afrazier Dec 03 '11 at 01:00
  • Both FTP and HTTP include a transfer resumption capability as an optional extension which is supported by the majority of servers and virtually all download managers. – ivan_pozdeev Dec 20 '11 at 03:28
  • They *may* resume, and theoretically TCP makes sure that data arrives in order and with a valid checksum. However, anyone who's had a large HTTP or FTP transfer corrupted has learned the value of more robust protocols or some kind of ECC. – afrazier Dec 20 '11 at 03:57