[09:46:25] FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:51:25] RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:40:27] topranks: o/ https://phabricator.wikimedia.org/T420223#11752904 - veeery weird [10:40:48] does it ring a bell? [10:52:05] elukey: just afk right now will check in a short while [10:52:27] np! Even tomorrow, nothing urgent [11:30:02] topranks: I've read the reply, make sense thanks.. What I am wondering though it the following - those worst latency timings align with the frequency of timeout that we are seeing between eqiad and codfw for mcrouter, so not necessarily a sign that the network is not performing well in the general use case, but that for super sensitive latency apps like mcrouter even outliers could count. [11:30:45] I am seeing also weird values to mc1041 from wikikube-worker1070 [11:31:47] is there a way to test if those bumps in latency are "real" or artificial? Like the router taking time to generate TTL exceeded etc.. [11:38:11] it's unrealistic on the network that we'll never have some jumps in RTT (say buffers get full due to some burst) [11:38:45] I think we need to engineer the apps to perform well even if we have occasionally higher RTT rather than try to fix at the network layer [11:39:00] there are fixes - look at the high-frequency trading world - but I'm not sure that's the way to approach this [11:40:29] elukey: it's hard to assess exactly what is causing the higher RTT. In theory if we have pcaps either side we can look at tx vs rx times for packets, and work out the one-way delay. [11:40:48] but what might be hard is getting that down to ms accuracy, given there will be some drift between the clocks on the system either side anyway [11:41:55] it absolutely could be due to buffering on the network when we have bursts though, so let's not assume it's only cosmetic due to ICMP generation [11:46:44] I added some other mtr reports and it seems also something happening within eqiad [11:47:12] topranks: you are totally right about the RTT expectations, but I am puzzled why this happens only on a subset of nodes [11:47:32] do you think it may depend by the other hosts in the rack and what they do? [11:49:42] elukey: the problem is finance will never sign off on the tens of millions you want to fix the jitter [11:50:13] one thing to bear in mind is that anything in eqiad row a/b is going to have packet drops and high jitter [11:50:36] this is due to those rows being connected at 10G to older (Trident 2) switches with low buffer memory [11:50:49] I'm not sure if that correlates with the hosts that have worse jitter [11:51:09] the opposite :D hosts with weird jitter are in D/C [11:51:12] the other thing I'd wonder is are these connected at 10g/1g or a mix? [11:51:19] well then let's just blame Nokia [11:51:22] :P [11:52:16] what we probably need is to get two sretest hosts in those same racks [11:52:22] ah wow interesting, some nodes are 1G afaics, I never tried ifstat to see what happens to them [11:52:37] and try to do some more production-grade RTT tests somehow, I'll have to look in to how we might do that [11:53:12] shouldn't matter too much 1g/10g, but if a burst of packets arrive at 10g they are cleared 10 times faster, thus less time queing for the last of those packets, thus less jitter [11:53:16] both good and bad workers have 1g afaics [11:53:24] 800ms is something insane - that *has* to be delay in one of the hosts [11:53:29] the network simply won't buffer anything that long [11:53:52] the 200ms values in the other one could be the network [11:53:54] Effie had a good idea to cordon some of the nodes that show high jitter from k8s, to see how the errors go [11:54:11] yeah that's definitely worth doing +1 [12:03:40] elukey: I'm gonna take a packet capture on wikikube-worker1070 for traffic to that host to see if I can spot anything [12:08:20] elukey: which hosts are 1g? [12:10:52] re: ifstat, we're pretty consistently making the rx pretty hot on a bunch of the NICs https://grafana.wikimedia.org/goto/bfh6hx31lt3i8e?orgId=1 [12:11:37] in some cases quite hot https://grafana.wikimedia.org/goto/dfh6hzzav8ni8e?orgId=1 [12:20:28] cdanis: the biggest mystery is that this all started sometime around mid december, for no apparent reason [12:21:21] to make things worse there are some expected error which are just mudding the waters [12:24:39] those are https://phabricator.wikimedia.org/T374366 (firewall rules and puppet runs) plus the occasional oom kills [12:32:10] cdanis: that is an interesting metric! nic_saturation_hot_seconds_total ?? [12:32:54] what is it based on / how is it exported? [12:38:02] topranks: so years ago we had memcached trouble because of microbursts of NIC saturation, and elukey set up a shell one-liner that ran like `ifstat 1 | awk` to count how many seconds/second the NIC was >= 80% utilization in either direction [12:39:24] it worked like a charm so https://gerrit.wikimedia.org/g/operations/puppet/+/production/modules/prometheus/files/usr/local/bin/prometheus-nic-saturation-exporter.py [12:39:30] it's a python script now that's installed on the fleet [12:40:05] right now "hot" is >=90% and "warm" is >=80% [12:40:45] 10netbox, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11753417 (10Jclark-ctr) Unable to wrap my head around the mistake this early today. Likely a typo in the wikikubes... [12:43:48] cdanis: wow that's amazing :) [12:44:35] # TODO: when we finally kill all Jessie hosts, type-annotate the return value as Iterable[str]. [12:44:37] lol [12:51:02] if anything, things look better since december https://grafana.wikimedia.org/goto/cfh6liazy8iyof?orgId=1 [13:15:12] We had some problems with 1G ingress on dse-k8s hosts. We solved it by getting rid of the 1G hosts, but if that's not practically y'all might look into RPS, assuming it's not already active on the wikikube workers [13:26:51] 10netbox, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11753681 (10Jclark-ctr) I also see a typo on wikikube-worker1371 Mac address in Netbox 90:5A:08:7B:1F:6D (netbox)... [13:27:19] topranks: do I have the chance to borrow you for a minte or 99 for some strange networking issues I'm trying to make sense of? [13:27:44] jayme: by all means [13:28:41] sweet [13:35:27] back sorry! [13:35:39] cdanis: I recall ifstat but I always forget the exporter! :D [13:36:38] effie: can we depool more wikikube workers? I know you removed the first two, but I think we'd need a little more to see something noticeable [13:40:50] 10netbox, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11753761 (10Jclark-ctr) Thanks for checking @Volans I still do not see any issues with backup1012 only issues when... [13:41:25] elukey: sure I am checking with the folks [13:58:54] elukey: I depooled 1070, we have another ongoing mystery so we will leave it at that for now [14:07:43] effie: okok, I made a list of the top pods yieding tkos in https://phabricator.wikimedia.org/T420223#11753854 [14:11:07] 10netops, 06Infrastructure-Foundations: mr1-eqiad: move from OSPF to BGP - https://phabricator.wikimedia.org/T421238#11753882 (10Papaul) @ayounsi please see below for the BGP config to setup BGP and remove OSFP between the mr router and the core routers. I will send out gerrit patch later today and merge it wh... [14:11:20] tx [15:59:22] 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 06SRE, and 3 others: Replace Spamassassin with Rspam for VRTS on Postfix - https://phabricator.wikimedia.org/T402260#11754661 (10ABran-WMF) I've been able to test my change on [[ https://wikitech.wikimedia.org/wiki/Puppet/Pontoon | Pontoon ]]:... [18:54:25] FIRING: [3x] SystemdUnitFailed: nic-saturation-exporter.service on ganeti2031:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:56:20] aw man [18:57:17] oh [19:04:25] FIRING: [3x] SystemdUnitFailed: nic-saturation-exporter.service on ganeti2031:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:24:25] RESOLVED: [2x] SystemdUnitFailed: nic-saturation-exporter.service on ganeti4008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:54:24] Hello! Traffic has been using sre.hosts.downtime and then sre.hosts.reboot-single for servicing LVS reboots. However, we've been noticing that our manual downtimes (i.e. all downtimes) are removed once the host has been rebooted. Is this known/desired behavior? [22:55:25] It's difficult for us as we intentionally keep some services (i.e. pybal.service) stopped even after reboot, which normally fires alarms [22:56:21] Yes, it's odd/perhaps inappropriate for services to be dictating availability like that, but that's how it be :/