[09:25:08] 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10ema) Thanks to netconsole (T242579) we finally managed to get the kernel oops of two upload@esams crashes. cp3051 crashing: ` Jan 26 21:20:27 ganeti3002 nc.openbsd[14771]: [3097828.536600] ------... [09:41:05] 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10ema) Source code taken from linux-source-4.9 4.9.189-3+deb9u2, the crash is at net/core/skbuff.c:1212 (see ema@boron.eqiad.wmnet:~/linux-source-4.9): ` 1185 /** 1186 * pskb_expand_head - realloca... [09:46:55] I saw some references to that codebase on newer kernel patches, time to try buster? [09:48:33] we've seen those crashes also with backup2001, though (which was running 4.19) [09:49:02] ofc we don't have netconsole there, so they could simply be similar, but not the same as on the esams caches [09:49:08] I know, although someone pointed on ticket they are not 100% sure they are related [09:49:16] have we ever performed extensive memory checks on the failing hosts? [09:49:22] e.g. cp3* ons seem more frequent [09:50:34] ema: I don't think so [10:11:06] jynus: yeah we have started with buster upgrades in ulsfo -- I guess we could proceed with upload@esams next, if things don't burn in ulsfo [10:11:48] moritzm: what's the recommended way to go these days, still memtest86? [10:14:35] ema: maybe split ones upgraded and ones not and do a test to see if things improve-- nothing will be wasted if doesn't work [10:20:09] ema: memtest86 is still the canonical FLOSS tool, but I'm pretty sure that if we want to test DIMM modules in a way that we can eventually have them swapped under warranty, we'll need to run some Dell tool, best to ping DC ops [10:22:21] unrelated but JFYI I'm going ahead with https://gerrit.wikimedia.org/r/c/operations/puppet/+/563977 [10:24:33] looking at the OOPS I don't think we're running into memory issues (and if there were any, they'd be corrected by ECC), it rather looks like a hardware issue of some sort (not necessarily that the hw is physically broken, I still believe that having BIOS/NIC/ILO firmware upgraded is the most promising way to resolve this) [10:27:42] moritzm: it's maybe worth trying https://github.com/torvalds/linux/commit/633547973ffc32fd2c815639d4675e1531f0896f too [10:29:36] I think that's unrelated to the crash, if there were an issue with atomic_t we'd see issues in many other places, the migration to refcount_t is some hardening to render some counter overflows non-exploitable [10:29:59] with refcount_t is still crashes, only in a way that prevents a use-after-free with code execution possibilities [10:30:37] ack [10:30:54] see https://lkml.org/lkml/2017/4/21/757 for some background [10:53:38] 10Traffic, 10Operations, 10Wikimedia-Logstash, 10observability, and 2 others: Port varnishlog consumers to log to syslog / logging infra - https://phabricator.wikimedia.org/T227108 (10fgiunchedi) Had to revert in https://gerrit.wikimedia.org/r/c/operations/puppet/+/569529, at least two issues found: 1. jo... [11:20:20] speaking about crashes... [11:23:44] oh, it may be actually unrelated (network), as I see cp3057 through serial console [11:25:36] not sure how to proceed, though [11:31:54] 10Traffic, 10Operations, 10ops-esams: cp3057 network down - https://phabricator.wikimedia.org/T244127 (10jcrespo) [11:32:42] 10Traffic, 10netops, 10Operations, 10ops-esams: cp3057 network down - https://phabricator.wikimedia.org/T244127 (10jcrespo) [11:37:06] jynus: mmh. I could login as root via console but now the system is entirely unresponsive [11:37:10] rebooting [11:37:17] oh [11:37:34] so it may be a real crash, it looked responsive to me at first [11:37:45] maybe it had only soft-crashed by then [11:38:10] feel free to decline the task [11:39:24] to be fair, I didn't do a full check, just I saw the login process up and working normally at that time [11:39:44] so I didn't want to do a restart [11:42:16] cool then, I was then wrong [11:42:40] for some meaning of "cool" 0:-) [11:42:44] :) [11:43:10] well I could also login and the host only stopped working after a little while [11:44:34] I'm not sure if this is always the case when we see hosts freezing (ie: host becomes unreachable via network but other things still work for a bit) [11:45:05] cannot say, first time I catch one "real time" [11:50:42] 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10ema) [11:50:55] 10Traffic, 10netops, 10Operations, 10ops-esams: cp3057 network down - https://phabricator.wikimedia.org/T244127 (10ema) p:05Triage→03Normal [11:55:33] 10Traffic, 10netops, 10Operations, 10ops-esams: cp3057 network down - https://phabricator.wikimedia.org/T244127 (10ema) The host went down at 11:17 according to icinga, and the following warning was reported a little earlier to netconsole. Unfortunately, we currently cannot tell which host sent which messa... [12:00:57] 10Traffic, 10netops, 10Operations, 10ops-esams: cp3057 network down - https://phabricator.wikimedia.org/T244127 (10jcrespo) +1, there where icinga errors as early as 11:15: ` [2020-02-03 11:15:57] SERVICE ALERT: cp3057;Webrequests Varnishkafka log producer;UNKNOWN;SOFT;1;CHECK_NRPE STATE UNKNOWN: Socket ti... [12:12:37] ema, jynus, not clear from the task, is that cp host issue a network one? [12:13:02] XioNoX: I thought so, but most likely general crashing issue [12:13:10] *I thought so before [12:14:09] as at first it was up and responsibe but unreachable by ping/ssh [12:14:10] great, one less email in my inbox :) [12:14:16] sorry for the noise [12:14:47] 10Traffic, 10Operations, 10ops-esams: cp3057 network down - https://phabricator.wikimedia.org/T244127 (10ayounsi) [12:14:49] no pb at all! [12:14:59] I saw however a few requests of connectivity question, please help us with those, as they are not somethikng I was able to handle [12:15:04] at noc@ [12:17:39] 10Traffic, 10Operations, 10ops-esams: cp3057 crash (was: network down) - https://phabricator.wikimedia.org/T244127 (10jcrespo) [12:18:01] yep, already replied to one, working on the others [12:18:13] thanks you! [12:19:08] 10Traffic, 10Operations, 10ops-esams: cp3057 crash (was: network down) - https://phabricator.wikimedia.org/T244127 (10jcrespo) [14:18:00] cp4031 is an hour away from a fds kaboom, i'm restarting it [14:30:29] cdanis: thanks [14:31:18] https://grafana.wikimedia.org/d/OU_pxz8Wz/cdanis-ulsfo-vcache-open-fds?orgId=1&from=now-6h&to=now [14:33:50] sorry for the bikeshedding ;) [14:33:55] nah, you're right [14:33:58] pcc is sad though [14:34:00] Error while evaluating a Function Call, $metric should begin with 'node_' then lowercase chars or _ and end with '_total' but is [varnish_filedescriptors_total] [14:34:15] renaming to 'node_varnish_filedescriptors_total' [15:33:43] 10netops, 10Operations: BFD session alerts due to inconsistent status on cr3-knams - https://phabricator.wikimedia.org/T240659 (10ayounsi) Now that the issue is on the cr1-eqiad to cr3-knams link, I'm going to push the following: `lang=diff,name=cr3-knams [edit system syslog] file messages { ... } + fi... [16:12:21] cdanis: thanks for updating the dashboard! [16:14:12] 👍 [20:46:43] 10Traffic, 10Operations, 10Inuka-Team (Kanban), 10MW-1.35-notes (1.35.0-wmf.16; 2020-01-21), 10Performance-Team (Radar): Code for InukaPageView instrumentation - https://phabricator.wikimedia.org/T238029 (10nshahquinn-wmf) Thanks for testing on Beta cluster, @SBisson! I see two server log entries here so... [20:56:03] 10Traffic, 10Operations, 10Inuka-Team (Kanban), 10MW-1.35-notes (1.35.0-wmf.16; 2020-01-21), 10Performance-Team (Radar): Code for InukaPageView instrumentation - https://phabricator.wikimedia.org/T238029 (10nshahquinn-wmf) [23:36:33] 10Traffic, 10Operations, 10ops-eqsin: rack/setup/install ps[12]-60[34]-eqsin - https://phabricator.wikimedia.org/T242250 (10RobH) Please note this has been confirmed as likely to occur on Feb 6th (GMT). Jin has approved that he can work during that window, and we need to get confirmation from @bblack that t...