How to run command on parts of input file

Question

I have ~ 40GB file, and a filter command that for some reason breaks when I try to run it on the file (even when passed via pipe).

But. It doesn't fail when I split input file into many small files, pass each of them via the filter, and concatenate outputs.

So, I'm looking for a way to do:

split file into small blocks (10MB?)
for each block run some command on it
concatenate output in correct order

but without first splitting the file completely (I don't want to use that much disk space).

I can write such program myself, but perhaps there is already something that would do what I need?

Have you considered posting your current filter command? Maybe some have a better solution instead of splitting the input file. — Arjan
– Arjan, Commented Aug 6, 2009 at 14:12
Arjan: sure, it's iconv -c -f utf8 -t utf8 it bails out on 40+gb file, but works great on the same file splitted into parts. not sure how's that relevant, but hey - it's not secret :) — user13185
– user13185, Commented Aug 6, 2009 at 15:42
Is your version of iconv large-file aware? See serverfault.com/questions/24803/… it may be a related problem. — romandas
– romandas, Commented Aug 6, 2009 at 19:40
@romandas: might not be, but I'm not in position to change iconv/system. — user13185
– user13185, Commented Aug 6, 2009 at 19:43

200_success · Accepted Answer · 2009-08-06 22:00:24Z

You are not the first person to run into this problem with iconv. Someone has written a Perl script to solve it.

iconv doesn't handle large files well. From the glibc source code, in iconv/iconv_prog.c:

/* Since we have to deal with arbitrary encodings we must read the whole text in a buffer and process it in one step. */

However, for your particular case, it might be better to write your own UTF-8 validator. You could easily distill iconv -c -f utf8 -t utf8 down to a small C program, with a loop that calls iconv(3). Since UTF-8 is modeless and self-synchronizing, you can process it in chunks.

#include <errno.h> #include <iconv.h> #include <stdio.h> #include <string.h> #include <unistd.h> #define BUFSIZE 4096 /* Copy STDIN to STDOUT, omitting invalid UTF-8 sequences */ int main() { char ib[BUFSIZE], ob[BUFSIZE], *ibp, *obp; ssize_t bytes_read; size_t iblen = 0, oblen; unsigned long long total; iconv_t cd; if ((iconv_t)-1 == (cd = iconv_open("utf8", "utf8"))) { perror("iconv_open"); return 2; } for (total = 0; bytes_read = read(STDIN_FILENO, ib + iblen, sizeof(ib) - iblen); total += bytes_read - iblen) { if (-1 == bytes_read) { /* Handle read error */ perror("read"); return 1; } ibp = ib; iblen += bytes_read; obp = ob; oblen = sizeof(ob); if (-1 == iconv(cd, &ibp, &iblen, &obp, &oblen)) { switch (errno) { case EILSEQ: /* Invalid input multibyte sequence */ fprintf(stderr, "Invalid multibyte sequence at byte %llu\n", 1 + total + sizeof(ib) - iblen); ibp++; iblen--; /* Skip the bad byte next time */ break; case EINVAL: /* Incomplete input multibyte sequence */ break; default: perror("iconv"); return 2; } } write(STDOUT_FILENO, ob, sizeof(ob) - oblen); /* There are iblen bytes at the end of ib that follow an invalid UTF-8 sequence or are part of an incomplete UTF-8 sequence. Move them to the beginning of ib. */ memmove(ib, ibp, iblen); } return iconv_close(cd); }

Kyle Brandt · Accepted Answer · 2009-08-06 14:08:33Z

If you do decide to write it yourself and you are talking about text files, you could use Perl with the Tie::File module. This allow you to work on large files a line at time in place. It is meant for just this sort of thing.

You could try Tie::File::AnyData if the file is not text too.

Robert Swisher · Accepted Answer · 2009-08-06 18:20:32Z

Edit: Just noticed you don't want to split the file in advance because of disk space, this probably wont work for you

Use split:

$ man split NAME split - split a file into pieces SYNOPSIS split [OPTION] [INPUT [PREFIX]] DESCRIPTION Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default size is 1000 lines, and default PREFIX is `x'. With no INPUT, or when INPUT is -, read standard input. Mandatory arguments to long options are mandatory for short options too. -a, --suffix-length=N use suffixes of length N (default 2) -b, --bytes=SIZE put SIZE bytes per output file -C, --line-bytes=SIZE put at most SIZE bytes of lines per output file -d, --numeric-suffixes use numeric suffixes instead of alphabetic -l, --lines=NUMBER put NUMBER lines per output file --verbose print a diagnostic to standard error just before each output file is opened --help display this help and exit --version output version information and exit SIZE may have a multiplier suffix: b 512, kB 1000, K 1024, MB 1000*1000, M 1024*1024, GB 1000*1000*1000, G 1024*1024*1024, and so on for T, P, E, Z, Y.

bmb · Accepted Answer · 2009-08-06 19:02:45Z

I suggest using sed to extract just the parts you want and piping the output into your command:

sed -n '1,1000p' yourfile | yourcommand

will pipe the first 1000 lines to yourcommand

sed -n '1001,2000p' yourfile | yourcommand

will pipe the next 1000 lines.

etc.

You could put this in a loop in a script if you want.

e.g.

#!/bin/bash size=1000 lines=`cat $1 | wc -l` first=1 last=$size while [ $last -lt $lines ] ; do sed -n "${first},${last}p" $1 | yourcommand first=`expr $last + 1` last=`expr $last + $size` done last=$lines sed -n "${first},${last}p" $1 | yourcommand

Justin Ellison · Accepted Answer · 2009-08-06 19:16:01Z

Try this:

 #!/bin/bash FILE=/var/log/messages CHUNKSIZE=100 LINE=1 TOTAL=`wc -l $FILE | cut -d ' ' -f1` while [ $LINE -le $TOTAL ]; do let ENDLINE=$LINE+$CHUNKSIZE sed "${LINE},${ENDLINE}p" $FILE | grep -i "mark" let LINE=$ENDLINE+1 done

user13185user13185 · Accepted Answer · 2009-08-06 19:34:31Z

Well - to everybody suggesting writing my own solution. I can. And I even can do it without multiple "scans" of input file. But the problem/question is: is there any ready tool?

Simplest Perl based approach might look like this:

#!/usr/bin/perl -w use strict; my ( $lines, $command ) = @ARGV; open my $out, '|-', $command; my $i = 0; while (<STDIN>) { $i++; if ($i > $lines) { close $out; open $out, '|-', $command; $i = 1; } print $out $_; } close $out; exit;

and now I can:

=> seq 1 5 1 2 3 4 5 => seq 1 5 | ./run_in_parts.pl 3 tac 3 2 1 5 4

Stack Exchange Network

How to run command on parts of input file

6 Answers 6

You must log in to answer this question.

Linked

Hot Network Questions

How to run command on parts of input file

6 Answers 6

You must log in to answer this question.

Linked

Related

Hot Network Questions