Google’s TCP Fast Open Broken with Exim…

The Issue…

After performing an update to a server I maintain (moving from Debian 10 (Buster) to Debian 11 (Bullseye), the installed version of Exim moved from 4.92.x to 4.94.2. Along with noted changes needed to ensure that the config didn’t fall foul of the restrictions around taints, I started to notice some odd behaviour, but only when delivering mail, and large mail at that (eg with attachments – a hundred kB or so was enough), to Google MX hosts. 

Google mail delivery has never previously been an issue – the usual suspects have been Yahoo and Hotmail. However, this time I was seeing mail queue entries that hung around for hours and hours for no apparent reason: the mail address was absolutely valid, I could deliver other mail to the target, and indeed mail with attachments through other outbound SMTP servers. Eventually, delivery would work, but after a delay (sometime substantial).

The only clue I had in mainlog were entries like:

/var/log/exim4/mainlog:2022-04-27 07:47:30 1njbGQ-005LxL-M5
H=gmail-smtp-in.l.google.com [2a00:1450:4010:c0e::1a]: SMTP timeout after
sending data block (199774 bytes written): Connection timed out

/var/log/exim4/mainlog:2022-04-27 07:50:10 1njbGU-005Lz8-RV
H=gmail-smtp-in.l.google.com [74.125.131.26]: SMTP timeout after end of data
(246239 bytes written): Connection timed out

It didn’t appear ipv4/v6 made a difference (see above), and running exim in debug mode wasn’t particularly informative:

1315427 tls_write(0x55941d1d57b0, 8191)
1315427 gnutls_record_send(session=0x55941d2a3cc0, buffer=0x55941d1d57b0, left=8191)
1315427 outbytes=8191
1315427 flushing headers buffer
1315427 writing data block fd=6 size=8191 timeout=300
1315427 tls_write(0x55941d1d57b0, 8191)
1315427 gnutls_record_send(session=0x55941d2a3cc0, buffer=0x55941d1d57b0, left=8191)
1315427 GnuTLS<3>: ASSERT: ../../lib/buffers.c[_gnutls_io_write_flush]:727
1315427 GnuTLS<3>: ASSERT: ../../lib/record.c[_gnutls_send_tlen_int]:582
1315427 GnuTLS<2>: WRITE: -1 returned from 0x6, errno: 104
1315427 GnuTLS<3>: ASSERT: ../../lib/buffers.c[_gnutls_io_write_flush]:722
1315427 GnuTLS<3>: ASSERT: ../../lib/record.c[_gnutls_send_tlen_int]:589
1315427 outbytes=-110
1315427 tls_write: gnutls_record_send err
1315427 LOG: MAIN
1315427   H=aspmx.l.google.com [2a00:1450:400c:c07::1a] TLS error on connection (send): The TLS connection was non-properly terminated.
1315427 ok=0 send_quit=0 send_rset=1 continue_more=0 yield=1 first_address is NULL
1315427 tls_close(): shutting down TLS
1315427 GnuTLS<2>: WRITE: -1 returned from 0x6, errno: 32
1315427 GnuTLS<3>: ASSERT: ../../lib/buffers.c[errno_to_gerr]:230
1315427 GnuTLS<3>: ASSERT: ../../lib/buffers.c[_gnutls_io_write_flush]:722
1315427 GnuTLS<3>: ASSERT: ../../lib/record.c[gnutls_bye]:294
1315427   SMTP(close)>>
1315427 LOG: MAIN
1315427   H=aspmx.l.google.com [2a00:1450:400c:c07::1a]: SMTP timeout after sending data block (289658 bytes written): Connection timed out
1315427 set_process_info: 1315427 delivering 1njm9Z-005WCX-HC: just tried aspmx.l.google.com [2a00:1450:400c:c07::1a] for myusername@mydomain.com: result DEFER
1315427 added retry item for T:aspmx.l.google.com:2a00:1450:400c:c07::1a: errno=110 more_errno=0,M flags=2

A packet trace gave a slightly more interesting, but not particularly informative set of results, which indicate that there are multiple retries of a TCP segment for which the ACK has already been received from the Google server.Screenshot of Wireshark analysis of tcpdump file

If I set:

sysctl net.ipv4.tcp_window_scaling=0

(i.e. turning off TCP Window Resizing in the kernel) then things worked fine, but this has the non-trivial side-effect of majorly affecting all network performance. Setting TCP Window Scaling back to “on” landed me with the issue again. At least it was reproducible…

The Workaround…

A post by Adam Barratt suggested that this was to do with TCP Fast Open (and specifically with Google’s implementation), though the referenced blog post seemed to indicate a different set of symptoms.

Still, I gave this a whirl by adding: 

hosts_try_fastopen = !*.l.google.com

into /etc/exim4/conf.d/transports/30_exim4-config_remote_smtp . Usefully, at the time of the first test, I had a load of mailq entries for mail going to Google hosts which had previously failed delivery as above. On restart of the exim4 processes, everything cleared itself on first queue run – and I’ve not seen any issues in the logs since (have been running this for 3 weeks+ now). 

Quite why disabling TCP Fast Open  for these hosts works, I’m not sure – it looks like it might be a combination of factors in the Debian release according to Adam’s post as above, but it’s not really an Exim bug at all as it does everything correctly: chalk one up to Google doing something very odd.   EDIT: This is a kernel bug as reported at: https://lore.kernel.org/lkml/E1nZMdl-0006nG-0J@plastiekpoot/ and is fixed in the 5.18 kernel which (as of 10-Aug-2022 is available in Debian Testing), but in the interim, at least until either users are brave enough to switch to a Testing kernel, or the kernel gets released into Stable, the above workround seems appropriate. 

Either way, not an obvious solution to a very weird problem where everything “just worked” prior to upgrade, but hopefully this might help some other exim admins who are scratching their heads as well…

Forty Globs in One

Can you image 40 globular clusters at one time? Well – yes! In fact, with a wide enough field of view you can image many more, especially if you point towards such a rich collection like those found in the Andromeda Galaxy. I’d previously done so using my WO refractor, but this time wanted to go much closer in, staring at the large open cluster NGC206, located in the outer arms of M31.

NGC206, located in the Andromeda Galaxy (Luminance only)Here is an image taken through the Newtonian at 1583mm focal length which shows the region around NGC206 (in L channel only). 

The annotated image below clearly shows the location of forty globular clusters from the Bologna Catalogue (plotted using Pixinsight’s Annotate tool, importing a custom catalogue generated as an extract from the Vizier tool). Figures in brackets are the catalogue magnitude. 

NGC206, located in the Andromeda Galaxy - annotated imageTime escaped me in doing the colour for this image, so it will remain as a greyscale image for the time being. However, the mottled effect in the star clouds within M31 are in fact real – close examination of the mottling and comparison to archive DSS images reveal identical patterns in the clouds – it’s at this level we start to resolve smaller stars in the Andromeda Galaxy – even larger telescopes such as the Subaru or the Hubble Space Telescope reveal much more detail in the star clouds.

Image captured Oct/Nov 2018 from W Oxfordshire, UK.

Sadr and The Butterfly

Sadr (Gamma Cygni) sits at the centre of the large asterism known commonly as the Northern Cross that makes up the constellation of Cygnus. The region surrounding this F8 supergiant is rich in dust and hydrogen within the plane of our Milky Way as this view, taken in the light of hydrogen alpha (at 656.3nm) shows. 

Region around Sadr (Gamma Cygni) and IC1318 in Cygnus in Hydrogen AlphaNotably, just to the east (left) of Sadr, is the Butterfly Nebula, IC1318. This actually sits much further away from us than Sadr at ~5000 light years (Sadr is about 750 ly distant). The nebula appears split into two by a dark dust lane running E-W, catalogued as LDN 889. 

In the lower right of the image is the Crescent Nebula, NGC6888, powered by the Wolf-Rayet star HD 192163, while to just to its east, and only faintly visible is the small planetary nebula, PN G75.5+1.7, also known as the Soap Bubble – though really a bigger scale and [OIII] filter data would make this much more obvious!

Image taken on 17th Sept 2018 from West Oxfordshire. 

QHY163M (-15C, gain/offset=200/85), Canon 200mm 2.8L II (@f3.85), Baader 7nm HA filter, Losmandy GM8 with Gemini 1.

3h40m total exposure (5min subs) with HDR Composition for bright stars (25x30sec). 

Processing in Pixinsight. 

Image is centred at: (nova.astrometry.net)
RA: 20h20m25s
Dec: 40°20’31”

 

 

The Eagle and The Swan

The plane of the Milky Way is littered with areas of star formation; regions of the sky that glow in the light of hydrogen alpha, emitted by large clouds of excited hydrogen. Two such areas are the Eagle Nebula (M16), and the Swan or Omega Nebula (M17) as shown (north is approximately up in this image).

Eagle Nebula and Swan Nebula in Hydrogen AlphaBoth nebulae are regions of recent and ongoing star formation – and both have associated open clusters containing young, hot, luminous stars whose radiation helps to excite and illuminate the nebulae themselves.

M16 is thought to lie at a distance of approximately 7000 light years, with M17 between 5000 and 6000 light years. Also shown is the open cluster M18.

Image was taken from Domaine des Mathevies, Sainte-Nathalène, Nouvelle-Aquitaine, France on 14/15/16th Aug 2018. This is a region of sky that I can’t really image from home: the southerly declination means that it’s never particularly high above the horizon from Oxfordshire, and this pairing never makes it above the roofline of our house from my home observatory!

Image taken using a QHY163M at -15C (gain/offset 200/85), Canon 200mm f2.8L II lens with a Baader 7nm H-Alpha filter. Tracked on a Losmandy GM8.

2h50m total exposure (in 5min sub frames).

Field centred at:
RA: 18h 19m 51s
Dec: -15° 02′ 59″
Size: 3.76 x 5.12 degrees

Collinder 399 – The Coathanger

Collinder 399 - The Coathanger and nearby dark nebulaeWithin the constellation of Vulpecula is an asterism catalogued as Collinder 399 (Cr 399, from a catalogue of open clusters published by Swedish astronomer Per Collinder). More commonly, this asterism is called the Coathanger, owing to its appearance  Brocchi’s Cluster, or Al Sufi’s Cluster. 

Despite being considered as a true open cluster for much of the 20th century, this is in fact a random grouping of stars – data on parallaxes and proper motions from Hipparcos (http://articles.adsabs.harvard.edu/full/1998A%26A…340..402B) shows these stars to be a chance alignment rather than any kind of bound cluster. 

To the north (and partly encompassing the Coathanger itself) is a reflection nebula with designation LBN130, while further to the north, located with a background of a larger dark nebula (LDN 767) are another two patches of reflection nebula VdB126/LBN134 and LBN 133. Surrounding the whole area are many members of the LDN (Lynds Catalog of Dark Nebulae) catalogue that block the light from the myriad background stars in the plane of the milky way – these are marked in the associated annotated image. 

Image was taken from Sainte-Nathalène, Perigord, France on evenings of 10th/11th Aug 2018. Camera was QHY163M running at -15C on a Canon 200mm f2.8/L II lens (@f3.85).  Mount was a Losmandy GM8, guided with a 160mm guidescope and ZWO ASI120MM.

Exposure was LRGB using Baader 36mm LRGB filters. Exposures were L=166 (1min sub-exposures); RGB=98:90:88 (2min subs) – totalling 7h22m. All taken at gain 75, offset 27. Image capture using Sequence Generator Pro; processing using Pixinsight. 

Field centred at:
RA: 19h 25m 37s
Dec: +21° 42′ 27″
Up is 358 degrees E of N, field size: 3.98 x 5.3 deg

This image was published in the Gallery section of Astronomy Now, October 2018.