[00:38:09] 10Traffic, 10Cloud-VPS, 10Operations, 10cloud-services-team (Kanban): cloudcontrol: decide on FQDN for service endpoints - https://phabricator.wikimedia.org/T223902 (10bd808) >>! In T223902#5211701, @bd808 wrote: > Reading the discussion here and in irc earlier today, I think the more general topic of whic... [07:27:17] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp3035.esams.wmnet'] ` The log can be found in `... [07:52:00] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3035.esams.wmnet'] ` Of which those **FAILED**: ` ['cp3035.esams.wmnet'] ` [07:53:37] lovely, the reimage failed with https://phabricator.wikimedia.org/P8592 [07:56:39] will reboot manually once puppet is happy [07:59:21] uh [08:01:18] alright, the host came back online just fine [08:01:22] puppet is happy [08:01:31] \o/ [08:01:40] I'll remove the downtimes and repool [08:13:48] vgutierrez: done with cp3035, feel free to go ahead with the ATS logging patch whenever [08:14:02] aaaawesome, thx [08:30:26] 10Traffic, 10Operations: ATS: traffic_layout currently forces to use its own copy of shared libraries - https://phabricator.wikimedia.org/T224428 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [08:50:10] <_joe_> ema: why is pybal now waiting one second before reconnecting to etcd after having read a config change? [08:50:24] <_joe_> this is preventing me from doing proper depool/repool cycles [08:50:27] <_joe_> it frankly sucks [08:50:48] _joe_: good morning :) [08:50:55] I imagine to fix some silly bug? [08:51:40] <_joe_> probably, I'm pretty sure you introduced it. But this is making a depool in batches/repool in batches impossible to do [08:51:52] <_joe_> on a reasonably large cluster at least [08:53:10] sorry, I've got to go afk now. Let's discuss this later! [08:54:36] <_joe_> ok [11:24:18] >No it’s a DNS change to the A records (where the domain looks for the website) there should be 2x A records one for the www. and one without. [11:24:27] * Reedy stabs [11:31:17] * Reedy suspects this guy wouldn't pass an interview with Arzhel ;) [12:39:37] _joe_: so the problem that reconnectTimeout tries to address is https://phabricator.wikimedia.org/P6711, where pybal would not reconnect to etcd [12:40:27] I'm totally fine to follow alternative approaches (and actually, I see that T169765 is still open, so probably the workaround does not even fully address the issue?) [12:40:28] T169765: pybal should automatically reconnect to etcd - https://phabricator.wikimedia.org/T169765 [12:50:54] <_joe_> ema: we might pivot a bit, I'm thinking of alternative approaches. [13:00:02] _joe_: k [13:00:53] jbond42: I've merged https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/512495/, thanks! [13:03:23] jbond42: I also amended https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/513596/ using my name/email to avoid unfairly attributing my grammar mistakes to you [13:09:55] ack thanks [13:56:41] 10HTTPS, 10Traffic, 10Operations: Provide acme-chief/TLS SNI list support in compile_redirects() - https://phabricator.wikimedia.org/T225096 (10Vgutierrez) [14:05:16] 10Traffic, 10Operations, 10ops-codfw: lvs2002: raid battery failure - https://phabricator.wikimedia.org/T213417 (10Papaul) p:05Normal→03Low [14:48:19] 10Traffic, 10Operations, 10User-notice: Rate limit requests in violation of User-Agent policy more aggressively - https://phabricator.wikimedia.org/T224891 (10ema) [14:49:46] 10Traffic, 10Operations, 10User-notice: Rate limit requests in violation of User-Agent policy more aggressively - https://phabricator.wikimedia.org/T224891 (10ema) We (#traffic) have decided to continue allowing requests violating the UA policy. Instead of blocking them, we will apply stricter rate limiting... [15:14:33] 10HTTPS, 10Traffic, 10Operations, 10Patch-For-Review: Provide acme-chief/TLS SNI list support in compile_redirects() - https://phabricator.wikimedia.org/T225096 (10Vgutierrez) p:05Triage→03Normal [15:33:35] 10netops, 10Operations, 10observability: Prometheus logs showing errors for routinator - https://phabricator.wikimedia.org/T225108 (10jbond) p:05Triage→03Normal [15:38:58] 10netops, 10Operations, 10observability: Prometheus logs showing errors for routinator - https://phabricator.wikimedia.org/T225108 (10ayounsi) Data collection stopped after the upgrade to Routinator 0.4.0: https://grafana.wikimedia.org/d/UwUa77GZk/rpki?refresh=5m&orgId=1&from=now-7d&to=now ` ayounsi@rpki100... [15:45:30] 10netops, 10Operations: RPKI Validation - https://phabricator.wikimedia.org/T220669 (10ayounsi) [15:45:32] 10netops, 10Operations, 10observability: Prometheus logs showing errors for routinator - https://phabricator.wikimedia.org/T225108 (10ayounsi) [15:54:33] 10netops, 10Operations, 10observability: Prometheus logs showing errors for routinator - https://phabricator.wikimedia.org/T225108 (10fgiunchedi) Yes that looks like an error on routinator side, you can also use `promtool check rules` to see what prometheus makes of that ` prometheus1003:~$ curl -s http://r... [16:04:28] 10netops, 10Operations, 10observability: Prometheus logs showing errors for routinator - https://phabricator.wikimedia.org/T225108 (10ayounsi) Opened https://github.com/NLnetLabs/routinator/issues/154 upstream. [16:04:45] 10netops, 10Operations, 10observability: Prometheus logs showing errors for routinator - https://phabricator.wikimedia.org/T225108 (10ayounsi) a:03ayounsi [16:30:02] 10netops, 10Operations, 10observability: Prometheus logs showing errors for routinator - https://phabricator.wikimedia.org/T225108 (10fgiunchedi) Unrelated to the issue at hand, but I'd also recommend prefixing metrics with `routinator_` so it is clear where they are coming from [17:00:12] 10netops, 10Operations, 10ops-eqiad: upgrade mr1-eqiad from EX4200 to EX4300 - https://phabricator.wikimedia.org/T225121 (10RobH) p:05Triage→03Normal [17:00:22] 10netops, 10Operations, 10ops-eqiad: upgrade mr1-eqiad from EX4200 to EX4300 - https://phabricator.wikimedia.org/T225121 (10RobH) [17:01:09] 10netops, 10Operations, 10ops-codfw: Setup new msw1-codfw - https://phabricator.wikimedia.org/T224250 (10RobH) [17:02:34] 10netops, 10Operations, 10ops-eqiad: upgrade msw1-eqiad from EX4200 to EX4300 - https://phabricator.wikimedia.org/T225121 (10RobH) [17:03:32] 10netops, 10Operations, 10ops-eqiad: upgrade msw1-eqiad from EX4200 to EX4300 - https://phabricator.wikimedia.org/T225121 (10RobH) 05Open→03Stalled Please note @papaul is working with @ayongsi to upgrade the codfw msw1 on T224250. The current plan is to allow that to complete, and then replicate its wor... [17:50:11] 10netops, 10Analytics, 10Operations: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Ottomata) [17:55:26] 10netops, 10Analytics, 10Operations: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10ayounsi) What are the needed network changes? The usual two are: 1/ switch port config (usually for DCops), for that we need to know which hosts... [18:10:47] 10netops, 10Analytics, 10Operations: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Ottomata) > 1/ switch port config (usually for DCops), for that we need to know which hosts are going to which vlan cloudvirtan100[1-5] should be... [19:07:06] 10netops, 10DC-Ops, 10Operations, 10observability: Send some LibreNMS alerts to dcops and netops only - https://phabricator.wikimedia.org/T224180 (10RobH) so I'd just email the google group. Then the default settings for the folks in that (DC ops) is to get email updates (unless they have disabled it.)