Parallelizing remote tar and getting output by ssh pipe

Question

I've tried to parallelize script I'm using, but so far GNU Parallel is very challenging.

I've got 2 files - one containing hosts on which to run command and second having params for command. Below is sample data:

$ cat workers.host [email protected] [email protected] [email protected] [email protected] $ cat paths /usr/local/jar/x/y/ jarxy /usr/local/jar/z/y/ jarzy /usr/local/jar/y/y/ jaryy /usr/local/far/x/y/ farxy /usr/local/jaz/z/z/ jazzz /usr/local/mu/txt/ana/ acc01 /usr/local/jbr/x/y/ accxy

And to process that, I use following script:

#!/bin/bash echo "Run this on 192.168.130.10"; DATA=`date +%F` DDAY=`date +%u` DOMBAC='nice tar cpzf' readarray -t hosts < workers.host len=${#hosts[@]}; processed=0; while read -r -a line; do let hostnum=processed%len; ssh ${hosts[$hostnum]} -i /root/.ssh/id_rsa "$DOMBAC - ${line[0]}" > "/data/backup/$DDAY/${line[1]}_${DATA}_FULL.tgz" let processed+=1; done < paths

This works well, however processes step-by-step on machine after machine. Hosts are quite overpowered and network isn't a problem here, so I wanted to parallelize this as much as possible. For example run 4 instances of tar command on each host and pipe output through ssh into properly named file. I am completely lost with parallel --results --sshloginfile... And what I ultimately try to accomplish is to have 4 jobs running on each host, each and every one with different params (so that for example host 2 doesn't overwrite what host 1 already did). Can this be done in GNU Parallel?

Ole Tange · Accepted Answer · 2017-01-24 08:15:59Z

First you need to know how to extract multiple arguments from a single line of input:

cat paths | parallel --colsep ' ' echo {2} == {1}

(I note that some of the {2} can be generated from {1}. If that is always the case, you might want to look into {= =}; but that is a story for another question).

To run jobs remotely you use -S:

cat paths | parallel --colsep ' ' -S server echo {2} == {1}

To control how many jobs are run on a remote server use -j:

cat paths | parallel --colsep ' ' -j4 -S server echo {2} == {1}

To get the output stored locally --return --cleanup can be used:

cat paths | parallel --return {2} --cleanup --colsep ' ' -j4 -S server echo {2} == {1} '>' {2}

As you want the output stored in a different dir locally, you will need to use the /./ trick:

cat paths | parallel --return my/local/dir/./{2} --cleanup --colsep ' ' -j4 -S server echo {2} == {1} '>' {2}

To use a special ssh command use --ssh:

parallel --ssh "ssh -i $HOME/.ssh/id_rsa.pub -lroot" --return {2} --cleanup --colsep ' ' -j4 -S server echo {2} == {1} '>' {2}

To run on multiple servers, use --slf instead of -S:

parallel --slf hosts.txt --ssh "ssh -i $HOME/.ssh/id_rsa.pub -lroot" --return {2} --cleanup --colsep ' ' -j4 echo {2} == {1} '>' {2}

So in total your command could look like this:

parallel --slf hosts.txt --ssh "ssh -i $HOME/.ssh/id_rsa.pub -lroot" --return /data/backup/$DDAY/./{2}_${DATA}_FULL.tgz --cleanup --colsep ' ' -j4 "$DOMBAC - {1} > {2}_${DATA}_FULL.tgz"

Stack Overflow comments do not lend itself towards code sharing and discussion ;) Anyway - I need to try Your solution. My current (crude one) relies on two things: parallel -j${JOBS} --link ./do_on_hosts.sh {1} {2} "$DATA" "$DDAY" :::: workers.host :::: to_backup.path and in do_on_hosts.sh: ssh $1 "${REMOTE_CMD} ${arr[0]}"> ${LOCAL_STORAGE_PATH}${DDAY}/${arr[1]}_${DATA}_FULL.tgz I wonder how that compares to Yours... need another round to try it. — Johnny_Bit
– Johnny_Bit, Commented Jan 28, 2017 at 20:08

Argonauts · Accepted Answer · 2017-01-22 18:24:23Z

You could use parallel to accomplish this, but I think it would be overkill for what you are trying to accomplish. Instead I would simply use background jobs to run these commands (close to) simultaneously.

To accomplish this with a minimum of changes to your existing script, all we need to do is background each task as it run (using the & operator to do this). In order to prevent orphaned processes, we should make sure that the script doesn't exit until all jobs have finished, which is accomplished with the bash builtin wait. The jobs command will output a list of running tasks (may not be all of them, some may have already finished before you hit that spot depending on execution time).

I'm also not sure on why you are using the nice command without an argument - I believe that without an argument all it will do is print the relative priority of the launched task, which I suppose could be your intent.

Here is a modified version of your script with these changes

#!/bin/bash echo "Run this on 192.168.130.10"; DATA=`date +%F` DDAY=`date +%u` DOMBAC='nice tar cpzf' readarray -t hosts < workers.host len=${#hosts[@]}; processed=0; while read -r -a line; do let hostnum=processed%len; ssh ${hosts[$hostnum]} -i /root/.ssh/id_rsa "$DOMBAC - ${line[0]}" > "/data/backup/$DDAY/${line[1]}_${DATA}_FULL.tgz" & let processed+=1; done < paths jobs wait

Hi! I've managed to parallelize this using gnu parallel, however I need to wait to ensure my solution works properly. In comment to proposed solution - well, that's actually why I need GNU Parallel - otherwise worker hosts would be spammed with tasks and would be unable to finish. Imagine that hosts file has more like 4-8 entries, but paths has 10-20 thousands paths. As for nice - without args it changes niceness from 0 to 10 :) — Johnny_Bit
– Johnny_Bit, Commented Jan 23, 2017 at 20:25
I was looking at the man page for nice man7.org/linux/man-pages/man1/nice.1.html — Argonauts
– Argonauts, Commented Jan 24, 2017 at 0:47
Post Your solution using parallel if you can so that if anyone else searches for this they find it — Argonauts
– Argonauts, Commented Jan 24, 2017 at 0:49

Stack Exchange Network

Parallelizing remote tar and getting output by ssh pipe

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Parallelizing remote tar and getting output by ssh pipe

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions