I was trying to convert bank-issued CSV files that have a weird data
layout, for automatic processing by the excellent ledger. This made me
learn a new thing about awk
. Follow me down the UNIXy goodness!
The data I was getting looks like this:
Transaction Date,Transaction Description,Debit Amount,Credit Amount,Balance
29/02/2018,Restaurant,14.5,,1234.2
12/03/2018,ATM withdrawal,50,,1184.2
23/04/2018,Deposited check,,100,1284.2
Note the last line having 100 in a different column: The problem is that, when receiving money, the transaction is logged in a separate entry (“Credit Amount”) than the (more common) case of debit, which goes to “Debit Amount” column. My importing tools required a single column, so the output must look like this:
Transaction Date,Transaction Description,Debit Amount,Credit Amount,Balance
29/02/2018,Restaurant,14.5,,1234.2
12/03/2018,ATM withdrawal,50,,1184.2
23/04/2018,Deposited check,-100,,1284.2
The change would be straightforward with sed
if it was a simple word
substitution problem, but the problem involved parsing the fields and
applying action on selected transactions only. Naturally I turned to
awk
to do this, giving me the following conversion:
awk -F',' '$4 {print $1 "," $2",-" $4 ",," $5}' data.csv
Resultin in
23/04/2018,Deposited check,-100,,1284.2
Decomposing the command, we asked awk
to parse fields delimited by
commas (-F',')
, applying changes on lines that have non-empty
fourth field (“Credit Amount”), and the changes are simply to print
the same numbers, shifted one column to the left with a -
in front.
This worked great to convert those entries, but would have meant I had
to separate “positive data” and “negative data and re-join them
manually later, possibly losing the original ordering. In effect, I
wanted to make awk
do nothing on debits, and swap credit operations
with the script above. I had to look up how to do that, and
re-discovered the -e
flag of awk
in the process!
-e program-text
--source program-text
Use program-text as AWK program source code. This option allows the
easy intermixing of library functions (used via the -f and -i
options) with source code entered on the command line. It is
intended primarily for medium to large AWK programs used in shell
scripts.
So we can use multiple programs on the same data, ending up with quite a clean solution:
awk -F',' \
-e '$4 {print $1 "," $2",-" $4 ",," $5}' \
-e '!$4 {print}' \
data.csv
Resulting as expected in:
Transaction Date,Transaction Description,Debit Amount,Credit Amount,Balance
29/02/2018,Restaurant,14.5,,1234.2
12/03/2018,ATM withdrawal,50,,1184.2
23/04/2018,Deposited check,-100,,1284.2
The script quite clearly shows that when there is a fourth field
(credit), apply a transformation, but when there isn’t, print the line
as it was before. awk
processes records line by line, so both
“programs” would be run on each line (with only one rule matching each
time), preserving the order just like I wanted to.
Note that the -e
flag is a common feature of such scripting languages,
as it applies in sed
as well, that’s what made me check the manual
again when thinking about this.
Recapping, sed
is good at word substitution (and more, but that’s not
the point) and awk
is fantastic when processing records with
delimiters, applying operations on each entry. Hurray for standard
UNIX tools!