Today I needed to keep an eye on PostgreSQL logs. Luckily, I decided upon installation to log everything using the “csvlog” format. But there’s a small catch, depending how you read that log. This catch is newline characters in database queries.
This has nothing to do with PostgreSQL directly. In fact, it does the right thing, in that it quotes all required fields. Now, a quoted field can contain a newline character. But if you read the file on a line-by-line basis (using methods like file_handle.readline, this will case problems. No matter what programming language you use, if you call readline, it will read up to the next newline character and return that. So, let’s say you have the following CSV record:
If you read this naïvely with “readline” calls, you will get the following:
6: FROM port p1
7: INNER JOIN port p2 USING (link)
8: WHERE p1.device='E'
9: AND p1.scope='provisioned'
10: AND p1.label='Eg'
11: AND (p1.device = p2.device
12: AND p1.scope = p2.scope
13: AND p1.label=p2.label) = false",,,,,,,,,""
Now, this is really annoying if you want to parse the file properly.
Read the file byte-by-byte, and feed a line to the CSV parser only if you hit a newline outside of quoted text. Obviously you should consider the newline style (\n, \r or \r\n) and the proper quote and escape characters when doing this.
What about Python?
It turns out, Python’s csv module suffers from this problem. The builtin CSV module reads files line-by-line. However, it is possible to override the default behavior.
For my own purpose, I wrote a simple script, reading from the postgres log until interrupted.
You are free to use this for your own purpose, modify or extend it as you like.
You can find it here: exhuma/postgresql-logmon