Blogs

Bash process substitution and the tee command

If you're familiar with the shell, you've probably come across the concept of "piping" commands into each other. A frequently-used example of this is looking for a process in ps by running grep on the resulting output:

$ ps aux | grep process-name

This pattern is exceptionally useful and comes up in numerous situations when working in the shell. However, sometimes you run into scenarios where you really need multiple pipelines at once or need to combine the output from several different commands. One option is always to redirect the output from one of the pipelines into a temporary file, but it can be nice to avoid this when possible. This is where process substitution and the tee command can come into play.

Process substitution

Much as the name indicates, process substitution allows you to substitute the results from one process as a file for a command. The syntax for this is very similar to normal redirection of stdin and stdout, but with the addition of parentheses. One of my frequent uses for process substitution is combining columns of output from different commands with the paste command. If you don't know, the paste command "merges lines of files" according to its man page. For example, if I have file a:

1
2
3
4
5

and a similar file b:

6
7
8
9
10

I can combine them together with paste a b to get

1 6
2 7
3 8
4 9
5 10

This is obviously a useful command on its own, but what if I only want to paste together a certain column from a multi-column file? Or what if I want to use AWK to modify the columns in some way before pasting them together? This is where process substitution comes in.

Let's stick with the second of these situations for now, and assume we want to multiply the values in file a by 5 and those in file b by 2. One way of doing this is

paste <( awk '{print $1*5}' a ) <( awk '{print $1*2}' b )

which results in the output below.

 5 12
10 14
15 16
20 18
25 20

One thing to make explicit is that process substitution is a feature of bash, not shells in general, at least with the syntax I have described here. I can only speak for the three I tried, bash, fish, and sh, but of those at least only bash utilized the syntax described above. The main time I have run into trouble with this is when writing a shell script and trying to use #!/bin/sh as my shebang line, but if you use a shell other than bash as your default shell you will likely have trouble as well.

tee

So that's process substitution, but what's this tee thing? According to its man page, tee allows you to "read from standard input and write to standard output and files." I've added emphasis here to focus on the really useful part of the command. Whereas normal pipelines have one way in and one way out, the aptly named tee acts just like a tee joint in a physical pipeline, allowing you to redirect information flow in multiple directions. A basic use in a variant of our first example command would log all running processes, while also continuing on to look for a certain one.

$ ps aux | tee ps.log | grep process-name

This writes the file ps.log with the full output from ps while also piping that full output on to grep, which will direct its output to stdout as usual.

Combining the two

What prompted this post was working on a script to summarize the output from the qstat command on one of our clusters, which basically just reports information about the running jobs. By default it returns information about all users, which you can filter down to only your username with the -u flag, but since I'm searching for stuff anyway, and I'm in the middle of reading The AWK Programming Language, I went ahead and searched for my username anyway with the help of AWK.

qstat | awk 'BEGIN {q = r = e =0} FNR == 1 {print} \
    (/USER/ && / [QRE] /) {if (/Q/) q++; if (/R/) r++; \
    if (/E/) e++; print} END \
    {printf "%d in queue\n%d running\n%d exiting\n", q, r, e}' \
    | tee >( head -n -3 ) >( tail -n 3 ) > /dev/null

The AWK part of this is not that interesting or at least the topic of a separate post, so let us focus on the tee part.

tee >( head -n -3 ) >( tail -n 3 ) > /dev/null

As we saw above, process substitution can be used to substitute the output from a command for a file used as the standard input for another command. However, you can also substitute a process for a file being used to catch the standard output from a command. That is what is happening here. tee writes what was piped into it to the files head -n -3 and tail -n 3, which print all but the last three lines and the last three lines of the pipeline respectively.

To make this a little more clear, the output of the AWK command on its own looks something like

Job id Name User Time Use S Queue
12345 jobName brent 00:00 R Q1
0 in queue
0 running
0 exiting

because counting up how many jobs are in queue, running, and exiting cannot finish until all of the input lines have been read, and by then there is no way to write it to the top of the output with AWK*. My earlier version of this script embarassingly ran multiple qstat commands and gathered the output into variables that were then printed at the end. While this version is not formatted very well and looks kind of rough, it runs three times faster than that one, while also providing more information.

The output after the tee line is

0 in queue
0 running
0 exiting
Job id Name User Time Use S Queue
12345 jobName brent 00:00 R Q1

which may look like a small change, but is actually exactly what I was looking for since I often have many hundreds or thousands of jobs in the queue and only want to see the oldest ones, which are actually running and end up at the top. Now when I pipe the output of the script into head I can see the summary, as well as the jobs most likely to finish soon.

The last part of the tee command is probably embarassing for me. My understanding of tee made me think I could substitute one of the head or tail commands and just pipe into the other like normal:

| tee >( head -n -3 ) | tail -n 3

Unfortunately, I have not been able to get the same output this way, and that also leaves me with some weird behavior that I managed to avoid by directing the "rest" to /dev/null. So if you know what I am doing wrong, feel free to shoot me an email!

Footnotes

*As I now know, you could save the output to an array in AWK and loop through the array to print the lines at the end.