Category Archives: computing

Google’s TCP Fast Open Broken with Exim…

June 1, 2022 Graeme Coates 2 Comments

The Issue…

After performing an update to a server I maintain (moving from Debian 10 (Buster) to Debian 11 (Bullseye), the installed version of Exim moved from 4.92.x to 4.94.2. Along with noted changes needed to ensure that the config didn’t fall foul of the restrictions around taints, I started to notice some odd behaviour, but only when delivering mail, and large mail at that (eg with attachments – a hundred kB or so was enough), to Google MX hosts.

Google mail delivery has never previously been an issue – the usual suspects have been Yahoo and Hotmail. However, this time I was seeing mail queue entries that hung around for hours and hours for no apparent reason: the mail address was absolutely valid, I could deliver other mail to the target, and indeed mail with attachments through other outbound SMTP servers. Eventually, delivery would work, but after a delay (sometime substantial).

The only clue I had in mainlog were entries like:

/var/log/exim4/mainlog:2022-04-27 07:47:30 1njbGQ-005LxL-M5
H=gmail-smtp-in.l.google.com [2a00:1450:4010:c0e::1a]: SMTP timeout after
sending data block (199774 bytes written): Connection timed out

/var/log/exim4/mainlog:2022-04-27 07:50:10 1njbGU-005Lz8-RV
H=gmail-smtp-in.l.google.com [74.125.131.26]: SMTP timeout after end of data
(246239 bytes written): Connection timed out

It didn’t appear ipv4/v6 made a difference (see above), and running exim in debug mode wasn’t particularly informative:

1315427 tls_write(0x55941d1d57b0, 8191)
1315427 gnutls_record_send(session=0x55941d2a3cc0, buffer=0x55941d1d57b0, left=8191)
1315427 outbytes=8191
1315427 flushing headers buffer
1315427 writing data block fd=6 size=8191 timeout=300
1315427 tls_write(0x55941d1d57b0, 8191)
1315427 gnutls_record_send(session=0x55941d2a3cc0, buffer=0x55941d1d57b0, left=8191)
1315427 GnuTLS<3>: ASSERT: ../../lib/buffers.c[_gnutls_io_write_flush]:727
1315427 GnuTLS<3>: ASSERT: ../../lib/record.c[_gnutls_send_tlen_int]:582
1315427 GnuTLS<2>: WRITE: -1 returned from 0x6, errno: 104
1315427 GnuTLS<3>: ASSERT: ../../lib/buffers.c[_gnutls_io_write_flush]:722
1315427 GnuTLS<3>: ASSERT: ../../lib/record.c[_gnutls_send_tlen_int]:589
1315427 outbytes=-110
1315427 tls_write: gnutls_record_send err
1315427 LOG: MAIN
1315427   H=aspmx.l.google.com [2a00:1450:400c:c07::1a] TLS error on connection (send): The TLS connection was non-properly terminated.
1315427 ok=0 send_quit=0 send_rset=1 continue_more=0 yield=1 first_address is NULL
1315427 tls_close(): shutting down TLS
1315427 GnuTLS<2>: WRITE: -1 returned from 0x6, errno: 32
1315427 GnuTLS<3>: ASSERT: ../../lib/buffers.c[errno_to_gerr]:230
1315427 GnuTLS<3>: ASSERT: ../../lib/buffers.c[_gnutls_io_write_flush]:722
1315427 GnuTLS<3>: ASSERT: ../../lib/record.c[gnutls_bye]:294
1315427   SMTP(close)>>
1315427 LOG: MAIN
1315427   H=aspmx.l.google.com [2a00:1450:400c:c07::1a]: SMTP timeout after sending data block (289658 bytes written): Connection timed out
1315427 set_process_info: 1315427 delivering 1njm9Z-005WCX-HC: just tried aspmx.l.google.com [2a00:1450:400c:c07::1a] for myusername@mydomain.com: result DEFER
1315427 added retry item for T:aspmx.l.google.com:2a00:1450:400c:c07::1a: errno=110 more_errno=0,M flags=2

A packet trace gave a slightly more interesting, but not particularly informative set of results, which indicate that there are multiple retries of a TCP segment for which the ACK has already been received from the Google server.

If I set:

sysctl net.ipv4.tcp_window_scaling=0

(i.e. turning off TCP Window Resizing in the kernel) then things worked fine, but this has the non-trivial side-effect of majorly affecting all network performance. Setting TCP Window Scaling back to “on” landed me with the issue again. At least it was reproducible…

The Workaround…

A post by Adam Barratt suggested that this was to do with TCP Fast Open (and specifically with Google’s implementation), though the referenced blog post seemed to indicate a different set of symptoms.

Still, I gave this a whirl by adding:

hosts_try_fastopen = !*.l.google.com

into /etc/exim4/conf.d/transports/30_exim4-config_remote_smtp . Usefully, at the time of the first test, I had a load of mailq entries for mail going to Google hosts which had previously failed delivery as above. On restart of the exim4 processes, everything cleared itself on first queue run – and I’ve not seen any issues in the logs since (have been running this for 3 weeks+ now).

Quite why disabling TCP Fast Open for these hosts works, I’m not sure – it looks like it might be a combination of factors in the Debian release according to Adam’s post as above, but it’s not really an Exim bug at all as it does everything correctly: chalk one up to Google doing something very odd. EDIT: This is a kernel bug as reported at: https://lore.kernel.org/lkml/E1nZMdl-0006nG-0J@plastiekpoot/ and is fixed in the 5.18 kernel which (as of 10-Aug-2022 is available in Debian Testing), but in the interim, at least until either users are brave enough to switch to a Testing kernel, or the kernel gets released into Stable, the above workround seems appropriate.

Either way, not an obvious solution to a very weird problem where everything “just worked” prior to upgrade, but hopefully this might help some other exim admins who are scratching their heads as well…

computing, sysadmin

Photon: Where is my Bandwidth Going?

November 7, 2015 Graeme Coates Leave a comment

I’ve recently been investigating some curious behaviour that has been going on on a server I administer. This is a VM hosted with Mythic Beasts, and it hosts a variety of sites, several of which use WordPress (like this!) – the specific problem I’ve seen is a large bandwidth increase over a couple of months. The following chart demonstrates:

Now, the usage levels are still well inside my allowed totals, but I decided to have a look – I know a couple of the sites are reasonably popular ones, and they could have had a surge in activity, but the figures are more interesting.

As it turned out, after totting up totals by vhost, the vast majority of the bandwidth was going on one site. In a 2 week period, 95% of the bandwidth was serving one vhost at about 6GB a day. More curiously, this vhost (running a fully patched WP instance) has of the order of 500 hits a day over this period. While there are some wonderful images on this site, and several per post, there’s should be nowhere near this amount of data. What makes it more confusing is that Jetpack was installed, and had Photon enabled.

Photon is the free service from WordPress that allows serving of images from a content delivery network (CDN), which is designed to take load (really bandwidth as images are simple to serve) away from the source server. Having done a little digging in the logs, it turns out there were entries like this:

192.0.86.129 - - [01/Nov/2015:03:40:11 +0000] 'GET /wp-content/uploads/2014/11/IMG_1667.jpg HTTP/1.1' 200 500600 '-' 'Photon/1.0' 192.0.86.185 - - [01/Nov/2015:03:40:17 +0000] 'GET /wp-content/uploads/2014/11/IMG_1667.jpg HTTP/1.1' 200 500600 '-' 'Photon/1.0' 192.0.86.162 - - [01/Nov/2015:03:40:20 +0000] 'GET /wp-content/uploads/2014/11/IMG_1667.jpg HTTP/1.1' 200 500600 '-' 'Photon/1.0'

192.0.86.143 - - [01/Nov/2015:03:47:35 +0000] 'GET /wp-content/uploads/2014/11/IMG_1667.jpg HTTP/1.1' 200 500600 '-' 'Photon/1.0' 192.0.86.125 - - [01/Nov/2015:03:47:38 +0000] 'GET /wp-content/uploads/2014/11/IMG_1667.jpg HTTP/1.1' 200 500600 '-' 'Photon/1.0'

So, 145 requests for the same file within 7 minutes – what on earth? CDN are supposed to cache to avoid rerequesting data like this. More confusing is that there just weren’t this number of page requests in this period – in fact there were only 18 requests for feeds or posts… Across 2 weeks in this single vhost, total traffic was 82GB, with 73GB of images being served to Photon alone – and there’s only 2GB of data on disc for the vhost, including code!

So, after raising a support call with Jetpack (and have to say the support staff were helpful and responsive here), I got a little way to working out what it was doing.

When Photon gets a request for a file it doesn’t have, it requests the full size image from the source blog, which is fine. Photon also does resizing/optimization of images when the image requested has relevant GET parameters (eg w, h, resize, etc) – here it does the same thing in that it re-requests the full size image and resizes it.

Though confirmed that this is the expected behaviour by the Devs, I think this is a design flaw. Photon already stores the images forever (if you want to change an image, you must change the filename – that’s documented) and it would make sense to retain the original and use this for resizing appropriately when it needs to. Even as is, you’d expect the images in use to get cached up and then be served from CDN so you should reasonably quickly build up all the sizes you need in cache.

So what about the 145 requests? Well, after being sent the logs by Jetpack support, it’s clear that there are a load of individual requests for the same image with different sizes, eg:

i2.wp.com - - - - [01/Nov/2015:03:40:39 +0000] GET /XXX/wp-content/uploads/2014/11/IMG_1667.jpg?zoom=2&resize=523%2C512 HTTP/1.1 <REFERER> i2.wp.com - - - - [01/Nov/2015:03:40:42 +0000] GET /XXX/wp-content/uploads/2014/11/IMG_1667.jpg?zoom=2&resize=519%2C508 HTTP/1.1 <REFERER> i2.wp.com - - - - [01/Nov/2015:03:40:45 +0000] GET /XXX/wp-content/uploads/2014/11/IMG_1667.jpg?zoom=2&resize=515%2C504 HTTP/1.1 <REFERER></code>

And so on, all the way down to a resize request with size 61 x 50px… So, this is expected behaviour, but (despite the referrer looking right) these image requests don’t appear to be from pages that are WP generated. I’m not sure really how and where these are being generated – are they bots, apps, scrapes, feeds? Hard to know, as I don’t have the full details, but it’s hugely inefficient once Photon starts dealing with those requests. This is by no means the worst – from the last 2 weeks:

# zgrep "Photon" site-access.log.* | grep "/wp-content/uploads/2015/08/imagename.jpg" | wc -l

5484

Ouch.

So, my current approach is to stop serving images in this way for this site – I can’t block it entirely, since you need Photon to serve image grids, etc, but after a day it looks to be going in the right direction (see the flat, non peaky graph in the last day):

I’d be interested if anyone else has seen similar behaviour – please leave comments!

Disappointingly, Jetpack say there’s no plans to change this behaviour, so for me this rules out its use on this size of site. It clearly does save bandwidth for normal requests, but it’s had a negative impact here – I could see it working for much larger image heavy sites, even with the frequent re-requests, but it’s just not working here.

EDIT 18-Nov-2015:

Point proved? (There was a popular post today which accounts for the late blip!)

astronomy, computing, eclipses

Eclipse Calculations using Python

March 18, 2015 Graeme Coates 2 Comments

On 20th March there is a deep partial solar eclipse over the UK which is total over the Faroe Islands and Svalbard. I set about trying to determine the time of maximum eclipse, and percentage eclipsed at a given location.

I used the pyephem module which is a Python implementation of the numerical routines that are behind the excellent Xephem. The module will allow a user to calculate the positions of astronomical objects (eg Sun, Moon, etc) from a given location at a given time. Using the positions of these objects it’s possible to determine the separation of the centres of these objects.

To work out the percentage eclipse at that time, I used the formula on the Wolfram Mathworld page on “Lunes” (author: Weisstein, Eric W). A “Lune” is the “plane figure bounded by two circular arcs of unequal radii, i.e., a crescent.” which exactly describes the visible surface of the sun during an eclipse.

With the area of the lune calculated, it’s trivial to work out the percent eclipse at that time – and by looping over time, it’s possible to get a list of lists to search for a maximum.

The code can be downloaded using this link: eclipse_percent.py

The result I get (using the location of the Science Oxford setup on 20th March – the SBS in Oxford, UK) is as follows:

Max Eclipse occurs at: 2015-03-20 09:30:11 GMT
Max percentage eclipse: 85.86
First contact: 08:24:21 GMT
Last contact: 10:40:04 GMT

All we need now is some decent weather…

The featured image here is an H-Alpha filtered frame taken by me on 3rd Oct 2005 from Denia, Spain, during the early phases of an annular eclipse.

computing, sysadmin

Debian Wheezy, Apache+FCGI+PHP; changes to /etc/mime.types and php5-cgi

December 16, 2013 Graeme Coates 1 Comment

Here’s a potentially useful note for anyone upgrading to Debian Wheezy on a system that uses Apache2 + FCGI + PHP. If you’ve configured it to run using one of several guides (like these: https://community.x10hosting.com/threads/debian-apache-2-2-fastcgi-php-5-suexec-the-easy-way.148894/, http://davejamesmiller.com/blog/how-to-set-up-php-fastcgi-with-suexec-on-debian) then you might well be bitten by a similar issue to that reported in http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=687307.

Previously, one could define config such as the following in an apache2 conf.d file:

AddType application/x-httpd-php .php

AddHandler php-fcgi .php
Action php-fcgi /fcgi-bin/php5-fcgi

<Location /fcgi-bin/>
SetHandler fcgid-script
Options +ExecCGI
</Location>

This would instruct Apache to use the handler “php-fcgi” to process .php files – with the “Action” referencing a wrapper held at /fcgi-bin/php5-fcgi (suitably aliased in the vhost). This all looks well and good and doesn’t appear to change between squeeze and wheezy (Apache is still at 2.2).

However, if you do a straight upgrade, you may find that your server starts serving out php files in plaintext (not only is your site down, but it’s a security risk as well with potential connection details listed in config files). In Wheezy, the php MIME types have disappeared from /etc/mime.types – php5-cgi now includes two files (in /etc/apache2/mods-available) to try and correct the missing MIME type definitions. With php5-cgi enabled in the webserver, the config as follows is included:

<FilesMatch ".+\.ph(p[345]?|t|tml)$"> SetHandler application/x-httpd-php </FilesMatch>

This sets the handler appropriately. With this set, Apache serves out the file as text, instead of using the relevant action “php-fcgi” – the FilesMatch directive overriding the old config. The fix is reasonably simple – comment out the AddType and AddHandler in the conf.d file and change the Action line so you have:

Action application/x-httpd-php /fcgi-bin/php5-fcgi

In the case you just want sidewide php5-cgi with no suexec, then you don’t even need the above – in php5-cgi.conf in mods-available, just uncomment the last section of the php5-cgi.conf file – this has a similar “Action” directive to that above. I keep the above as I use suExec to run the fcgi processes under individual accounts (you’re unable to call outside of the suexec root, and it’s easy to repoint the fcgi-bin location appropriately in each virtualhost).

(Note that this type config appears also to be not vulnerable to execution of files of the type evil.php.jpg thanks to the FilesMatch directive in the module .conf)

Featured image adapted from work by W. Rebel (Wikimedia Commons)

computing

Testing, Fixing and Costs

March 10, 2009 Graeme Coates 3 Comments

For those of you who don’t know me, you almost certainly won’t know what I do. Of course, there are probably a load of people who do know me, who still don’t know what I do (and, no, “Nothing” is not the answer). I work as a software tester and I have done for the last 7+ years now in a few different places.

Generally, this job involves a fair bit of evangalism – sometimes it’s quite successful (eg Promoting the use of Bugzilla as a defect tracking tool). One of my favourite diagrams is that shown in this post – I like this graph a lot. It is a graph showing the rough relationship between the cost of fixing a bug or defect, and what stage of the development process that bug or defect was found.

It’s fairly clear from the graph that, the later you realise there is a problem, the more it costs you to go back and unravel what is wrong and sort it. The reasons are fairly clear – if you find a problem at a later stage, you often have to go right back to the beginning of the process of development, testing and so on.

Some notes I like to make relating to this:

Even if you are already employed by a company, you are not “free”. Having someone fix a problem, and work repeated costs money – “we already pay their wages” is not an argument! Sheffield Teaching Hospitals Trust – take heed (Quote from the Reg article: The trust argued that the consequences of its decision making had not cost public money, “just time and effort by the IT teams”.).
Accurate. timely requirements are essential. Finding out that you have mis-specified something as it is nearing the release date is a Bad Thing™.
Not having requirements before coding is asking for even more trouble.
Changing requirements part way through the process (or, worse, finding out during testing that your requirements were duff!) is much along the lines of 2 and 3 with similar outcomes (moving goalposts anyone?).
Doing unit testing is much better than sending code straight to the testers – it saves a lot of heartache on both sides…
Actually having enough time to perform a sufficient level of testing can save you an enormous amount of hassle and cost.
Squishing bugs as you go at the earliest possible opportunity is much advised – multiple bugs can quickly make a system unusable and costly to fix up. (There is another similarly shaped graph – see it as The Law Of Bugterial Infection)
No one is perfect… not even me 😉

Feel free to use the graph above if you want and evangelise away…

chromosphere

Category Archives: computing

Google’s TCP Fast Open Broken with Exim…

The Issue…

The Workaround…

Like this:

Photon: Where is my Bandwidth Going?

Like this:

Eclipse Calculations using Python

Like this:

Debian Wheezy, Apache+FCGI+PHP; changes to /etc/mime.types and php5-cgi

Like this:

Testing, Fixing and Costs

Like this:

Astronomy and Astrophotography (Some IT Stuff too…)

The Issue…

The Workaround…

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Astronomy and Astrophotography (Some IT Stuff too…)