Lickety split

by Robert Daeley

Recently I had an Apache access log file on a remote server that I wanted to archive. However, it was 3GB, and /usr/bin/zip refused to even admit the behemoth's existence.

First idea that came to mind was splitting the file into smaller chunks that zip could deal with. For some reason, the prospect of an arduous manual process that would take me through Flag Day didn't appeal, so I poked around via apropos to see what was available:

$ apropos split

Lo and behold, at the end of a bunch of other stuff,

split(1) - split a file into pieces

(The server was running OS X 10.3, which as far as I can tell does not include the more direct zipsplit utility found on 10.4. Same basic idea, though.)

I copied the behemoth to a secondary drive (took a while) and then navigated to its directory.

$ ls -l

which let me know:

-rw------- 1 robert staff 4239286441 10 Jun 04:55 behemoth_log

That's a lot of bytes. Since I want to get the largest file down to a svelte 500MB, I'll need to use this:

$ split -b 500m behemoth_log

Which, after a long period of splitting, produces these:

$ ls -lh

-rw------- 1 robert staff 3G 10 Jun 04:55 behemoth_log
-rw------- 1 robert staff 500M 10 Jun 05:18 xaa
-rw------- 1 robert staff 500M 10 Jun 05:19 xab
-rw------- 1 robert staff 500M 10 Jun 05:20 xac
-rw------- 1 robert staff 500M 10 Jun 05:20 xad
-rw------- 1 robert staff 500M 10 Jun 05:21 xae
-rw------- 1 robert staff 500M 10 Jun 05:22 xaf
-rw------- 1 robert staff 500M 10 Jun 05:22 xag
-rw------- 1 robert staff 500M 10 Jun 05:23 xah
-rw------- 1 robert staff 42M 10 Jun 05:23 xai

Alternatively I could have split it by kilobytes, or by number of lines using the -l line_count flag. There is also the ability to customize the output file names -- read up on man split for more info. By the way, I'm guessing you want to limit your splitting to text files, so leave those binaries alone. See comments regarding using split on binaries as well.


2007-06-15 13:47:00
split works on binaries, too. Just cat them together before use.

And 'zip' cmd failed because ZIP archives have a 32-bit size limit per entry.

'gzip' cmd should have worked.

2007-06-16 07:11:34
yeah, is there any reason you chose not to gzip or bzip2? the compression's great and the tools are common:

tar -cvjf bohemoth_log.tar.bz2 behemoth_log


bzip2 -c bohemoth_log > bohemoth_log.bz2

2007-06-16 08:15:34
I [re]discovered split about a month ago for a problem at work. We had some large files to load into a database and the logs were filling up with each attempt to load them. After writing a dozen lines or so of Perl to split them, that nagging feeling that this already exists at the command line kicked in. Now our generic data loader checks the file size and calls split if it's larger than we want to deal with in a single chunk.
2007-06-16 08:25:07
I've used split to split a 4gb .dmg-file in order to be able to save it on my Windows-formatted iPod (FAT32 has a file size limit of 2gb). Worked fine.
2007-06-16 09:08:31
No real reason not to use gzip -- this is more about split than anything else.
Robert Hook
2007-06-16 16:55:12
I have no idea who wrote "split", but many times I have blessed his/her. A staggeringly useful tool in the right situation, and the epitomy of the Tao of Unix.