Blogs
Bash process substitution and the tee command
If you're familiar with the shell, you've probably come across the
concept of "piping" commands into each other. A frequently-used
example of this is looking for a process in ps
by
running grep
on the resulting output:
This pattern is exceptionally useful and comes up in numerous
situations when working in the shell. However, sometimes you run into
scenarios where you really need multiple pipelines at once or need to
combine the output from several different commands. One option is
always to redirect the output from one of the pipelines into a
temporary file, but it can be nice to avoid this when possible. This
is where process substitution and the tee
command can
come into play.
Process substitution
Much as the name indicates, process substitution allows you to
substitute the results from one process as a file for a command.
The syntax for this is very similar to normal redirection of stdin
and stdout, but with the addition of parentheses. One of my frequent
uses for process substitution is combining columns of output from
different commands with the paste
command. If you don't know,
the paste command "merges lines of files" according to its man page.
For example, if I have file a:
2
3
4
5
and a similar file b:
7
8
9
10
I can combine them together with paste a b
to get
2 7
3 8
4 9
5 10
This is obviously a useful command on its own, but what if I only want to paste together a certain column from a multi-column file? Or what if I want to use AWK to modify the columns in some way before pasting them together? This is where process substitution comes in.
Let's stick with the second of these situations for now, and assume we want to multiply the values in file a by 5 and those in file b by 2. One way of doing this is
which results in the output below.
10 14
15 16
20 18
25 20
One thing to make explicit is that process substitution is a feature
of bash, not shells in general, at least with the syntax I have
described here. I can only speak for the three I tried, bash, fish,
and sh, but of those at least only bash utilized the syntax
described above. The main time I have run into trouble with this is
when writing a shell script and trying to use #!/bin/sh
as my shebang line, but if you use a shell other than bash as your
default shell you will likely have trouble as well.
tee
So that's process substitution, but what's this tee
thing? According to its man page, tee
allows you to
"read from standard input and write to standard output and files."
I've added emphasis here to focus on the really useful part of the command.
Whereas normal pipelines have one way in and one way out, the aptly named
tee
acts just like a tee joint in a physical pipeline,
allowing you to redirect information flow in multiple directions. A
basic use in a variant of our first example command would log all
running processes, while also continuing on to look for a certain one.
This writes the file ps.log with the full output from ps
while also piping that full output on to grep
, which will
direct its output to stdout as usual.
Combining the two
What prompted this post was working on a script to summarize the output
from the qstat
command on one of our clusters, which basically just
reports information about the running jobs. By default it returns information
about all users, which you can filter down to only your username with the
-u
flag, but since I'm searching for stuff anyway, and
I'm in the middle of reading The AWK Programming Language,
I went ahead and searched for my username anyway with the help of
AWK.
qstat | awk 'BEGIN {q = r = e =0} FNR == 1 {print} \ (/USER/ && / [QRE] /) {if (/Q/) q++; if (/R/) r++; \ if (/E/) e++; print} END \ {printf "%d in queue\n%d running\n%d exiting\n", q, r, e}' \ | tee >( head -n -3 ) >( tail -n 3 ) > /dev/null
The AWK part of this is not that interesting or at least the topic of a separate
post, so let us focus on the tee
part.
As we saw above, process substitution can be used to substitute the
output from a command for a file used as the standard input for
another command. However, you can also substitute a process for a
file being used to catch the standard output from a command. That is
what is happening here. tee
writes what was piped into
it to the files head -n -3
and tail -n 3
,
which print all but the last three lines and the last three lines of
the pipeline respectively.
To make this a little more clear, the output of the AWK command on its own looks something like
12345 jobName brent 00:00 R Q1
0 in queue
0 running
0 exiting
because counting up how many jobs are in queue, running, and exiting
cannot finish until all of the input lines have been read, and by
then there is no way to write it to the top of the output with AWK*.
My earlier version of this script embarassingly ran
multiple qstat
commands and gathered the output into
variables that were then printed at the end. While this version is
not formatted very well and looks kind of rough, it runs three times
faster than that one, while also providing more information.
The output after the tee
line is
0 running
0 exiting
Job id Name User Time Use S Queue
12345 jobName brent 00:00 R Q1
which may look like a small change, but is actually exactly what I was looking for
since I often have many hundreds or thousands of jobs in the queue and only want to see
the oldest ones, which are actually running and end up at the top. Now when I pipe the
output of the script into head
I can see the summary, as well as the jobs
most likely to finish soon.
The last part of the tee
command is
probably embarassing for me. My understanding of tee
made me think I could substitute one of the head
or tail
commands and just pipe into the other like
normal:
Unfortunately, I have not been able to get the same output this way, and that also leaves me with some weird behavior that I managed to avoid by directing the "rest" to /dev/null. So if you know what I am doing wrong, feel free to shoot me an email!
Footnotes
*As I now know, you could save the output to an array in AWK and loop through the array to print the lines at the end.