bash lore: how to properly parse NUL separated fields

As a lot of other part of bash, this is black magic.

Lets suppose a friendly command that spits out NUL separated fields (as find -print0, shyaml get-values-0, ...). Which - may I insist - is the recommended way to communicate wild binary data in a solid way in bash.

How would you parse adequately each individual records by paquets ?

For the purpose of demonstration, lets use the fixed content of a simple data.bin file as our NUL-separated input:

cat <<EOF | tr : "\000" > /tmp/data.bin
a:1:b:2 3:c:4
  5:d:6\n7:e::f:9
EOF

Let's verify that we have our NUL bytes:

$ cat /tmp/data.bin | hexdump -v -e '/1 "%02X "'
61 00 31 00 62 00 32 20 33 00 63 00 34 0A 20 20 35 00 64 00 36 5C 6E 37 00 65 00 00 66 00 39 0A

You have noticed that we have some values containing:

  • spaces (hex: 20),
  • line breaks (hex: 0A),
  • a \ followed by a n.
  • a 0 sized value
  • a final value 9 ending with a 0a and no final 00.

If using NUL separated fields is recommended, it's to support this kind of data.

I want the implementation of a function read-0 that would allow this type of interaction:

$ cat /tmp/data.bin | while read-0 f1 f2; do
    echo "f1: '$f1', f2: '$f2'"
  done
f1: 'a', f2: '1'
f1: 'b', f2: '2 3'
f1: 'c', f2: '4
  5'
f1: 'd', f2: '6\n7'
f1: 'e', f2: ''
f1: 'f', f2: '9
'

First try

Let's be naive, and we'll use read f1 f2:

$ cat /tmp/data.bin | while read f1 f2; do echo "f1: '$f1', f2: '$f2'"; done
f1: 'a1b2', f2: '3c4'
f1: '5d6n7ef9', f2: ''

You can notice that:

  • NUL char where ignored for field separation
  • fields where separated upon consecutive space or return, it uses value stored in IFS environment variable.
  • their are only 2 lines because the \n was used to separate each record. We should use -d to specify the line delimiter.
  • Note that the NUL chars are also extracted out of the data as variables don't support the NUL char.
  • The \ was eaten, because read builtin parse and give it special meaning. We should use -r to avoid that.

But how should we provide the NUL delimiter to the read builtin ? knowing that you can't put NUL chars on the command line ? Hopefully I stumbled onto this blog post: http://transnum.blogspot.sg/2008/11/bashs-read-built-in-supports-0-as.html

Conclusion is that -d '' should be understood magically by bash read builtin to delimit lines with NUL characters.

Better try

Let's apply our new acquired knowledge by trying IFS=$'\0' read -d '' -r f1 f2:

$ cat /tmp/data.bin | while IFS=$'\0' read -d '' -r f1 f2; do echo "f1: '$f1', f2: '$f2'"; done
f1: 'a', f2: ''
f1: '1', f2: ''
f1: 'b', f2: ''
f1: '2 3', f2: ''
f1: 'c', f2: ''
f1: '4
  5', f2: ''
f1: 'd', f2: ''
f1: '6\n7', f2: ''
f1: 'e', f2: ''
f1: '', f2: ''
f1: 'f', f2: ''

That's much better. But notice that:

  • we didn't get anything in $f2, that's normal: by specifying NUL as line delimiter (with -d '') and having NUL as field delimiter (IFS) we will be doomed to have one field per record. We will need to manage the repacking in a while loop. This doesn't sound too difficult.
  • where's the final field 0A ? Hum, as there is no NUL final character in the data, read returned errlvl 1 on this last field but filled correctly the variable. A simple echo $f1 prints 9 (if you use this form: while IFS='' read -d '' -r f1 f2; do echo "f1: '$f1', f2: '$f2'"; done < /tmp/data.txt to access variables of the while).

Final Implementation ?

So knowing this, here is the final implemetation of read-0:

read-0() {
    local eof
    eof=
    while [ "$1" -a -z "$eof" ]; do
        IFS='' read -r -d '' "$1" || eof=true
        shift
    done
    test "$eof" != true -o -z "$1"
}

Final ? It surely properly works for our specification test. But what happens if EOF happens before we have fed all the variables ?:

$ echo -n "a" | while read-0 f1 f2; do echo "f1: '$f1', f2: '$f2'"; done
$

Nothing is spit out, despite the fact that we have sent a character.

This is now a specification issue: Do we want read-0 to return errorlevel 0 when it hits EOF while filing the variables ? Okay, but we said 0-sized string was a possible value... read-0 in the current specification knows it hit EOF while filling variables as your first variable can be the 0-sized string. We could make a special case, but I want to be able to distinguish a last empty element from an element.

That read-0, in the actual specification, can't do it. But we can offer a slight change in the way you build your while loop to allow that parsing.

Correct Implementation

To fill partial records, will need another specification change as current implementation will fail whenever it encounters EOF. This is an incompatible specification issue. Aside from this, we need also to take care to actually set the value of the remaining fields to the empty string. This will require to use another version of read-0:

read-0() {
    local eof
    eof=
    while [ "$1" ]; do
        IFS='' read -r -d '' -- "$1" || eof=true
        shift
    done
    test "$eof" != true
}

So this would work with read-0:

$ echo -n a | tr :  '\000' | {  eof= ; while [ -z $eof ]; do read-0 f1 f2 || eof=true; echo "f1: '$f1', f2: '$f2'"; done  }
f1: 'a', f2: ''

$ echo -n a: | tr :  '\000' | {  eof= ; while [ -z $eof ]; do read-0 f1 f2 || eof=true; echo "f1: '$f1', f2: '$f2'"; done  }
f1: 'a', f2: ''

Basically, this construct allows a last round in the loop after detecting EOF... and achieve the starting spec:

$ cat /tmp/data.bin | {  eof= ; while [ -z $eof ]; do read-0 f1 f2 || eof=true; echo "f1: '$f1', f2: '$f2'"; done  }
f1: 'a', f2: '1'
f1: 'b', f2: '2 3'
f1: 'c', f2: '4
  5'
f1: 'd', f2: '6\n7'
f1: 'e', f2: ''
f1: 'f', f2: '9
'

Trivial ?

Happy hacking.