[07:06:17] 10netops, 10Operations: intermittent brief data dropouts for esams netflow data - https://phabricator.wikimedia.org/T253128 (10Joe) Looking at kafka, it seems there is a bizarre pattern in producing the data to the "netflow" topic: https://grafana.wikimedia.org/d/000000234/kafka-by-topic?panelId=34&fullscree... [07:57:40] 10Traffic, 10netops, 10Operations: Advertise 198.35.27.0/24 as anycast prefix - https://phabricator.wikimedia.org/T253196 (10ayounsi) p:05Triage→03Medium [08:31:57] 10Traffic, 10netops, 10Operations: Advertise 198.35.27.0/24 as anycast prefix - https://phabricator.wikimedia.org/T253196 (10ayounsi) [08:32:04] 10Traffic, 10netops, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Anycast AuthDNS - https://phabricator.wikimedia.org/T98006 (10ayounsi) [08:33:05] 10Traffic, 10Operations: Implement a prometheus exporter for rdkafka in golang - https://phabricator.wikimedia.org/T253197 (10ema) [08:35:24] 10Traffic, 10Analytics, 10Operations, 10Patch-For-Review: Create replacement for Varnishkafka - https://phabricator.wikimedia.org/T237993 (10ema) 05Open→03Resolved a:03ema Closing this task now given that an initial version of `atskafka` has been created and deployed. Further improvements such as T2... [08:53:22] 10Traffic, 10Operations, 10vm-requests, 10Patch-For-Review: Create a Ganeti VM for Wikidough - https://phabricator.wikimedia.org/T253024 (10Dzahn) ` Ready to create Ganeti VM malmok.codfw.wmnet in the ganeti01.svc.codfw.wmnet cluster on row A with 2 vCPUs, 8GB of RAM, 30GB of disk in the private network. ` [08:56:15] <_joe_> I am having some upsetting results with pybal on lvs1015 [08:56:31] <_joe_> specifically, IdleConnection only retries on failure after ~ 40 seconds [08:56:38] <_joe_> which seems a very long hiatus to me [08:57:23] <_joe_> well it depends, but there is clearly a massive check lag happening [08:57:58] <_joe_> I hope I'll help a bit when I remove all the non-https endpoints, but that will take some time [09:05:49] _joe_: how are those 40 seconds measured? [09:06:03] <_joe_> log lines :) [09:06:07] <_joe_> but it varies [09:06:14] _joe_: 40 seconds after envoy/nginx/apache2 goes away, or 40 seconds after pybal notices that the service went away? [09:06:15] <_joe_> sometimes it's 5 seconds, sometimes it's 40 [09:06:33] <_joe_> 40 seconds between failure messages [09:06:36] ack [09:06:38] <_joe_> I have a server that is down [09:06:47] <_joe_> and comes up after ~ 5 minutes [09:07:00] <_joe_> I've noticed that checks to idleconnection happen quite sparsely [09:07:09] <_joe_> while proxyfetch happen at a reasonable pace [09:12:41] idleconnection doesn't "check" [09:12:47] it maintains a connection persistently [09:13:00] and reports when that drops, and it can't immediately reconnect [09:13:03] so it should be very fast [09:13:13] unless the connection reset or icmp message is not reported/doesn't arrive [09:14:05] so do you think it's event loop lag? [09:14:40] <_joe_> I am suspecting that, but have no proof [09:14:52] i have some 3 year old lying around to add metrics to the event loop [09:14:57] code [09:15:36] <_joe_> I assume it's lag because sometimes the check is very fast [09:15:43] <_joe_> May 20 09:04:03 lvs1015 pybal[12322]: [appservers-https_443 IdleConnection] WARN: mw1271.eqiad.wmnet (disabled/partially up/not pooled): Connection to 10.64.0.66:443 failed. [09:15:44] haha, "code" is an important word in that sentence [09:15:45] <_joe_> May 20 09:04:06 lvs1015 pybal[12322]: [appservers-https_443 IdleConnection] WARN: mw1271.eqiad.wmnet (disabled/partially up/not pooled): Connection to 10.64.0.66:443 failed. [09:15:50] <_joe_> this is 3 seconds apart [09:15:51] XioNoX: indeed ;) [09:15:54] i also have a 3 year old lying around [09:16:05] <_joe_> XioNoX: you thought he wanted to push the kid to production? :P [09:16:14] hahah [09:16:18] he has all the right characteristics for it [09:16:33] OCD, detail oriented, know-it-all [09:17:46] <_joe_> so for instance, picking a random server. Idleconnection first failed at 09:02:24, when the server went down [09:19:01] <_joe_> it was checked again at 09:02:27, then at 09:02:36, then at 09:02:57 (all failing) before finally being reported up at 09:03:56 [09:19:42] it might be the exponential backoff of the retry logic [09:19:47] <_joe_> no further failures from idleconnection in the logs [09:19:58] <_joe_> mark: yes that could explain it [09:20:18] looking at the code, that seems likely [09:20:29] <_joe_> it also makes sense that there is some backoff; maybe we could cap it at some upper limit [09:21:06] i think it already has that ;) [09:21:13] max-delay in the config [09:21:37] ) [09:21:38] self.maxDelay = self._getConfigInt('max-delay', self.MAX_DELAY) [09:21:47] class IdleConnectionMonitoringProtocol(monitor.MonitoringProtocol, protocol.ReconnectingClientFactory): [09:22:24] <_joe_> so maybe we didn't specify it [09:22:30] nice eh, twisted! [09:22:34] * mark ducks ;-p [09:22:57] <_joe_> idleconnection.max-delay = 300 [09:23:01] <_joe_> maybe a bit high [09:23:06] indeed [09:23:18] <_joe_> 5 minutes is a long time to let a server out of rotation [09:25:22] especially considering that we send probes from varnish-fe to ats-be every 100ms to make sure we don't miss a thing [10:28:24] 10Traffic, 10netops, 10Operations: Advertise 198.35.27.0/24 as anycast prefix - https://phabricator.wikimedia.org/T253196 (10ayounsi) [11:18:26] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Advertise 198.35.27.0/24 as anycast prefix - https://phabricator.wikimedia.org/T253196 (10ayounsi) [12:22:26] 10Traffic, 10Operations, 10vm-requests, 10Patch-For-Review: Create a Ganeti VM for Wikidough - https://phabricator.wikimedia.org/T253024 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `malmok.codfw.wmnet` - malmok.codfw.wmnet (**FAIL**) - Failed downtime h... [12:26:01] I'm going to restart pybal on codfw low-traffic for https://gerrit.wikimedia.org/r/c/operations/puppet/+/597485 [13:03:30] ok so now the pybal diff check complains about the old service on 10902 not being cleaned up: I'm going to remove the real servers and the virtual server with ipvsadm --delete-service 10.2.1.53:10902 (?) [13:14:11] 10Traffic, 10Discovery, 10Operations, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Gehel) >>! In T243701#5985617, @Ladsgroup wrote: > I think this would be a decision by @Lydia_Pintscher... [13:15:32] 10Traffic, 10Operations, 10vm-requests, 10Patch-For-Review: Create a Ganeti VM for Wikidough - https://phabricator.wikimedia.org/T253024 (10Dzahn) @ssingh The VM has been created (now with public IP). It has been added to site.pp with the role(insetup) and the first puppet ran that creates users and insta... [13:15:40] 10Traffic, 10Operations, 10vm-requests, 10Patch-For-Review: Create a Ganeti VM for Wikidough - https://phabricator.wikimedia.org/T253024 (10Dzahn) 05Open→03Resolved [13:16:29] 10Traffic, 10Operations: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) public resolver - https://phabricator.wikimedia.org/T252132 (10Dzahn) A VM called malmok.wikimedia.org has been created and can be used now. Currently it has the "insetup" role in site.pp. [13:16:56] 10Traffic, 10Operations, 10vm-requests, 10Patch-For-Review: Create a Ganeti VM for Wikidough - https://phabricator.wikimedia.org/T253024 (10Dzahn) ` root@malmok:~# gen_fingerprints +---------+---------+-----------------------------------------------------+ | Cipher | Algo | Fingerprint... [13:31:16] documented here now: https://wikitech.wikimedia.org/wiki/PyBal#PyBal_IPVS_diff_check [13:31:55] godog: i think it needs a restart of pybal? [13:32:49] oh, i see the wikitech link, nevermind then [13:32:53] mutante: depends on what's happened yeah [13:33:29] last time Valentin helped me fix something similar when adding a service [13:35:02] that's true on adding a new service the alert will fire too until pybal is restarted [13:39:03] err... [13:39:21] I wouldn't change ipvs configurations outside pybal TBH [13:39:45] it's safer to trigger a pybal restart IMHO [13:41:22] I did restart pybal after changing a service port and afaik the underlying ipvs service doesn't get cleaned up [13:41:37] the now-stale service that is [13:42:24] if that's a bug or not expected please LMK and I'll followup with a task [13:42:33] expected [13:42:41] pybal at that point doesn't know about it and doesn't do anything about it [13:42:56] right [13:43:31] yeah I seemed to remember the same, pybal doesn't touch ipvs services it doesn't know about anymore [13:43:33] we've always had to manually do "ipvsadm -Dt 1.2.3.4:567" or whatever after the pybal restart, it doesn't have any state to know to clean those up when something is removed (or moved, similarly) [13:44:02] it -could- just wipe out everything and start with a clean slate [13:44:04] it does wipe out ipvs service entries on startup, to recreate them, but only for the service defs it has at startup time, as opposed to wiping everything [13:44:07] as a simple implementation [13:44:46] back in the old days it was a comfortable thought to easily be able to stop pybal and have the state still there while you debug pybal issues ;) [13:44:47] but really we're wanting to move, if anything, in the other direction (where it maintains/consumes state about ipvs to where it doesn't have to pointlessly delete unchanged ones) [13:46:11] right, IIRC I haven't removed a service since I've joined WMF :) [13:46:15] just added new ones [13:47:49] means we're not sunsetting enough valentin! ;) [15:27:07] https://www.theregister.co.uk/2020/05/20/google_chrome_83/ <- DoH in Chrome 83 [15:30:36] yeah. also in the recent days, Windows: https://www.zdnet.com/article/microsoft-adds-initial-support-for-dns-over-https-doh-in-windows-insiders/ [15:31:25] the interesting change is that Chrome now supports custom DoH providers (like Firefox), whereas earlier they would only do "upgrade" (that is, if you have configured a DoH-enabled provider, they would switch to that) [15:35:32] 10Acme-chief, 10cloud-services-team (Kanban): tools/toolsbeta: improve acme-chief integration - https://phabricator.wikimedia.org/T252762 (10aborrero) 05Open→03Declined I'm unblocked now. Closing this task in favor of whatever we decide on {T252721}. [16:48:07] 10Traffic, 10Operations, 10ops-eqsin: cp5012 memory errors - https://phabricator.wikimedia.org/T251219 (10RobH) [16:57:45] 10netops, 10DC-Ops, 10Operations, 10ops-eqsin: (Need By: TBD) rack/setup/install cr3-eqsin.wikimedia.org - https://phabricator.wikimedia.org/T253246 (10RobH) [16:58:01] 10netops, 10DC-Ops, 10Operations, 10ops-eqsin: (Need By: TBD) rack/setup/install cr3-eqsin.wikimedia.org - https://phabricator.wikimedia.org/T253246 (10RobH) [17:27:24] 10Acme-chief, 10cloud-services-team (Kanban): tools/toolsbeta: improve acme-chief integration - https://phabricator.wikimedia.org/T252762 (10Krenair) we might still do this, we'll see :) [17:32:58] 10Traffic, 10Operations, 10ops-eqsin: cp5012 memory errors - https://phabricator.wikimedia.org/T251219 (10RobH) Ok, for memory tests we need to clear the SEL, so just dumping its output here for easy review later (its stored in the server still but not readable without a data dump and sorting): ` admin1->... [17:33:46] Heya traffic folks, i am investigating https://phabricator.wikimedia.org/T251219 and to do memory troubleshooting i may need to reboot and depool this [17:33:58] is it ok to depool a single host (cp5012)? [17:34:10] also dell will rquire i update firmware, etc... [17:34:19] bblack: ^ =] [17:35:17] robh: cp5012 is already depooled in confctl [17:35:47] well should have looked before asking sorry about htat [17:35:50] good enough =] [17:48:14] rebooting it, updating the bios firmware, and then running memtest [18:21:07] 10Traffic, 10netops, 10Operations, 10Performance-Team (Radar): Anycast AuthDNS - https://phabricator.wikimedia.org/T98006 (10BBlack) Update: the `nsa` authdns IP at `198.35.27.27` is live internally everywhere and monitored and working. There's some stuff to finish up later this week for the public side i... [18:23:32] 10Traffic, 10netops, 10Operations, 10Performance-Team (Radar): Anycast AuthDNS - https://phabricator.wikimedia.org/T98006 (10BBlack) (correction - it's also internet-reachable via ulsfo only for now, in this interim state, just by chance because it's still advertising the whole original /23) [18:24:05] 10Traffic, 10Discovery, 10Operations, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Addshore) Indeed, currently I would only see wdqs inclusion in maxlag as a bandaid waiting for a proper... [18:29:59] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Advertise 198.35.27.0/24 as anycast prefix - https://phabricator.wikimedia.org/T253196 (10ayounsi) [18:37:54] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Advertise 198.35.27.0/24 as anycast prefix - https://phabricator.wikimedia.org/T253196 (10ayounsi) [18:54:22] 10Traffic, 10Analytics, 10Operations: Publishing project anomaly data for censorship researchers. Evaluate privacy threats - https://phabricator.wikimedia.org/T183990 (10Nuria) a:05Nuria→03None [19:12:25] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Advertise 198.35.27.0/24 as anycast prefix - https://phabricator.wikimedia.org/T253196 (10ayounsi) [21:12:12] 10Traffic, 10Operations, 10ops-eqsin: cp5012 memory errors - https://phabricator.wikimedia.org/T251219 (10RobH) a:05RobH→03Vgutierrez So this ran the full suite of Dell tests, including extended memory testing, without failure. I did update the firmware before testing though. @Vgutierrez Can we return... [21:40:15] 10netops, 10Operations, 10ops-codfw: (Need by: End of July-2020 ) codfw:rack/setup/new management switches - https://phabricator.wikimedia.org/T253154 (10Papaul) [21:41:02] heyas traffic folks, so cp5012 shows no errors in memtest after firmware update. Can we return it to service? [21:41:18] it may immediately then throw a memory error, who knows [21:41:25] but dell will want to see a failure after firmware update [22:13:04] 10Traffic, 10Operations, 10Patch-For-Review: Decom LVS recdns - https://phabricator.wikimedia.org/T239993 (10BBlack) The `kraz` case is gone now (yay!) and hasn't recurred since the ircd restart above. What's left appears to be all infrastructure stuff: PDUs, switches, firewalls, etc. I've picked up quite... [22:14:15] robh: yeah I can put it back in for now and see what happens [22:14:42] err, I'll let vgutierrez do it later actually [22:14:53] since I won't be around much today starting shortly [22:16:13] Firmware Release Notes for version 111c.53zulu8: Bugfix: ECC multi-bit memory errors are now suppressed in order to reduce support calls [22:18:07] cool [23:49:16] 10netops, 10Operations, 10ops-eqiad: (Need by: 2019-09-30) upgrade msw1-eqiad from EX4200 to EX4300 - https://phabricator.wikimedia.org/T225121 (10wiki_willy) Hi @faidon - one of the goals we have this quarter is to resolve all backlogged install tasks from q3 and earlier by end of June. With the limited nu...