0

I have ~ 40GB file, and a filter command that for some reason breaks when I try to run it on the file (even when passed via pipe).

But. It doesn't fail when I split input file into many small files, pass each of them via the filter, and concatenate outputs.

So, I'm looking for a way to do:

  • split file into small blocks (10MB?)
  • for each block run some command on it
  • concatenate output in correct order

but without first splitting the file completely (I don't want to use that much disk space).

I can write such program myself, but perhaps there is already something that would do what I need?

4
  • 2
    Have you considered posting your current filter command? Maybe some have a better solution instead of splitting the input file. Commented Aug 6, 2009 at 14:12
  • Arjan: sure, it's iconv -c -f utf8 -t utf8 it bails out on 40+gb file, but works great on the same file splitted into parts. not sure how's that relevant, but hey - it's not secret :) Commented Aug 6, 2009 at 15:42
  • Is your version of iconv large-file aware? See serverfault.com/questions/24803/… it may be a related problem. Commented Aug 6, 2009 at 19:40
  • @romandas: might not be, but I'm not in position to change iconv/system. Commented Aug 6, 2009 at 19:43

6 Answers 6

0

You are not the first person to run into this problem with iconv. Someone has written a Perl script to solve it.

iconv doesn't handle large files well. From the glibc source code, in iconv/iconv_prog.c:

/* Since we have to deal with arbitrary encodings we must read the whole text in a buffer and process it in one step. */ 

However, for your particular case, it might be better to write your own UTF-8 validator. You could easily distill iconv -c -f utf8 -t utf8 down to a small C program, with a loop that calls iconv(3). Since UTF-8 is modeless and self-synchronizing, you can process it in chunks.

#include <errno.h> #include <iconv.h> #include <stdio.h> #include <string.h> #include <unistd.h> #define BUFSIZE 4096 /* Copy STDIN to STDOUT, omitting invalid UTF-8 sequences */ int main() { char ib[BUFSIZE], ob[BUFSIZE], *ibp, *obp; ssize_t bytes_read; size_t iblen = 0, oblen; unsigned long long total; iconv_t cd; if ((iconv_t)-1 == (cd = iconv_open("utf8", "utf8"))) { perror("iconv_open"); return 2; } for (total = 0; bytes_read = read(STDIN_FILENO, ib + iblen, sizeof(ib) - iblen); total += bytes_read - iblen) { if (-1 == bytes_read) { /* Handle read error */ perror("read"); return 1; } ibp = ib; iblen += bytes_read; obp = ob; oblen = sizeof(ob); if (-1 == iconv(cd, &ibp, &iblen, &obp, &oblen)) { switch (errno) { case EILSEQ: /* Invalid input multibyte sequence */ fprintf(stderr, "Invalid multibyte sequence at byte %llu\n", 1 + total + sizeof(ib) - iblen); ibp++; iblen--; /* Skip the bad byte next time */ break; case EINVAL: /* Incomplete input multibyte sequence */ break; default: perror("iconv"); return 2; } } write(STDOUT_FILENO, ob, sizeof(ob) - oblen); /* There are iblen bytes at the end of ib that follow an invalid UTF-8 sequence or are part of an incomplete UTF-8 sequence. Move them to the beginning of ib. */ memmove(ib, ibp, iblen); } return iconv_close(cd); } 
1

If you do decide to write it yourself and you are talking about text files, you could use Perl with the Tie::File module. This allow you to work on large files a line at time in place. It is meant for just this sort of thing.

You could try Tie::File::AnyData if the file is not text too.

0

Edit: Just noticed you don't want to split the file in advance because of disk space, this probably wont work for you

Use split:

$ man split NAME split - split a file into pieces SYNOPSIS split [OPTION] [INPUT [PREFIX]] DESCRIPTION Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default size is 1000 lines, and default PREFIX is `x'. With no INPUT, or when INPUT is -, read standard input. Mandatory arguments to long options are mandatory for short options too. -a, --suffix-length=N use suffixes of length N (default 2) -b, --bytes=SIZE put SIZE bytes per output file -C, --line-bytes=SIZE put at most SIZE bytes of lines per output file -d, --numeric-suffixes use numeric suffixes instead of alphabetic -l, --lines=NUMBER put NUMBER lines per output file --verbose print a diagnostic to standard error just before each output file is opened --help display this help and exit --version output version information and exit SIZE may have a multiplier suffix: b 512, kB 1000, K 1024, MB 1000*1000, M 1024*1024, GB 1000*1000*1000, G 1024*1024*1024, and so on for T, P, E, Z, Y. 
0

I suggest using sed to extract just the parts you want and piping the output into your command:

sed -n '1,1000p' yourfile | yourcommand 

will pipe the first 1000 lines to yourcommand

sed -n '1001,2000p' yourfile | yourcommand 

will pipe the next 1000 lines.

etc.

You could put this in a loop in a script if you want.

e.g.

#!/bin/bash size=1000 lines=`cat $1 | wc -l` first=1 last=$size while [ $last -lt $lines ] ; do sed -n "${first},${last}p" $1 | yourcommand first=`expr $last + 1` last=`expr $last + $size` done last=$lines sed -n "${first},${last}p" $1 | yourcommand 
0

Try this:

 #!/bin/bash FILE=/var/log/messages CHUNKSIZE=100 LINE=1 TOTAL=`wc -l $FILE | cut -d ' ' -f1` while [ $LINE -le $TOTAL ]; do let ENDLINE=$LINE+$CHUNKSIZE sed "${LINE},${ENDLINE}p" $FILE | grep -i "mark" let LINE=$ENDLINE+1 done 
0

Well - to everybody suggesting writing my own solution. I can. And I even can do it without multiple "scans" of input file. But the problem/question is: is there any ready tool?

Simplest Perl based approach might look like this:

#!/usr/bin/perl -w use strict; my ( $lines, $command ) = @ARGV; open my $out, '|-', $command; my $i = 0; while (<STDIN>) { $i++; if ($i > $lines) { close $out; open $out, '|-', $command; $i = 1; } print $out $_; } close $out; exit; 

and now I can:

=> seq 1 5 1 2 3 4 5 => seq 1 5 | ./run_in_parts.pl 3 tac 3 2 1 5 4 

You must log in to answer this question.