[07:52:05] 06Traffic, 10Diff-blog: Redirect techblog.wikimedia.org to diff.wikimedia.org - https://phabricator.wikimedia.org/T417940#12003478 (10Aklapper) [07:52:12] 06Traffic, 10Diff-blog: Redirect techblog.wikimedia.org to diff.wikimedia.org - https://phabricator.wikimedia.org/T417940#12003480 (10Aklapper) [07:57:25] brett: hello, there is an outstanding change on cp5018's switch [07:57:48] is it good to push it? https://www.irccloud.com/pastebin/BSoCXpE0/ [09:14:43] Puppet is disabled on durum5003 with a note about a reimage since May 26, I'll re-enable it [09:20:14] 06Traffic, 06Infrastructure-Foundations, 13Patch-For-Review: eqsin: re-image rack 604 servers on new vlan - https://phabricator.wikimedia.org/T428229#12003715 (10ops-monitoring-bot) VM prometheus5003.eqsin.wmnet switching disk type to plain [09:27:40] FIRING: VarnishPrometheusExporterDown: Varnish Exporter on instance cp5018:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [09:28:43] FIRING: HaproxyKafkaExporterDown: HaproxyKafka on cp5018 is down - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaExporterDown - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=eqsin&var-instance=cp5018 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaExporterDown [09:58:19] 10netops, 06Infrastructure-Foundations, 13Patch-For-Review: Add (some) collection for Nokia SR-Linux components - https://phabricator.wikimedia.org/T428685#12003881 (10cmooney) >>! In T428685#12003236, @ayounsi wrote: >> I can't remember exactly the issue there, > It was causing the `sr_oc_mgmt_server` deamo... [10:17:51] 10netops, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A5 maintenance - https://phabricator.wikimedia.org/T428020#12003962 (10jijiki) [10:23:43] FIRING: HaproxyKafkaExporterDown: HaproxyKafka on cp5018 is down - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaExporterDown - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=eqsin&var-instance=cp5018 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaExporterDown [10:59:19] 10netops, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A5 maintenance - https://phabricator.wikimedia.org/T428020#12004114 (10jcrespo) Re: **backup2013**, it needs no special treatment other than downtime, it has no issue with a temporary network maintenance unless it gets extended for a few d... [11:34:23] 06Traffic, 06Infrastructure-Foundations, 13Patch-For-Review: eqsin: re-image rack 604 servers on new vlan - https://phabricator.wikimedia.org/T428229#12004243 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti5004.eqsin.wmnet with OS bookworm [12:46:53] 06Traffic, 06Infrastructure-Foundations, 13Patch-For-Review: eqsin: re-image rack 604 servers on new vlan - https://phabricator.wikimedia.org/T428229#12004553 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti5004.eqsin.wmnet with OS bookworm completed: - gan... [13:27:55] FIRING: VarnishPrometheusExporterDown: Varnish Exporter on instance cp5018:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [14:41:20] FIRING: DnsboxServiceMismatch: Service ntp-a state mismatch on dns1004:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://grafana.wikimedia.org/d/96fb573c-0f3c-456a-886c-e50c29f3ed48/dns-box-service-state?var-site=eqiad&var-instance=dns1004:9100 - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [14:42:02] yeah that's fine [14:42:34] all good on the sync, transient alert when it was depooled for a restart [14:46:20] RESOLVED: DnsboxServiceMismatch: Service ntp-a state mismatch on dns1004:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://grafana.wikimedia.org/d/96fb573c-0f3c-456a-886c-e50c29f3ed48/dns-box-service-state?var-site=eqiad&var-instance=dns1004:9100 - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [15:03:05] FIRING: [3x] DnsboxServiceMismatch: Service ntp-a state mismatch on dns1004:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [15:07:50] RESOLVED: [2x] DnsboxServiceMismatch: Service ntp-b state mismatch on dns1005:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [15:10:41] XioNoX: Sorry for that, I didn't realize it was sitting in a queue! [15:13:50] FIRING: [2x] DnsboxServiceMismatch: Service ntp-c state mismatch on dns1006:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [15:16:05] RESOLVED: [2x] DnsboxServiceMismatch: Service ntp-c state mismatch on dns1006:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [15:19:29] It's been pushed now [15:20:22] thx [15:22:20] FIRING: DnsboxServiceMismatch: Service ntp-b state mismatch on dns2005:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://grafana.wikimedia.org/d/96fb573c-0f3c-456a-886c-e50c29f3ed48/dns-box-service-state?var-site=codfw&var-instance=dns2005:9100 - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [15:27:05] RESOLVED: [2x] DnsboxServiceMismatch: Service ntp-a state mismatch on dns2004:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [15:33:50] FIRING: [3x] DnsboxServiceMismatch: Service ntp-a state mismatch on dns2004:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [15:37:05] RESOLVED: [2x] DnsboxServiceMismatch: Service ntp-b state mismatch on dns2005:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [15:42:09] 10Wikimedia-Apache-configuration, 10ServiceOps-Mediawiki: Serve mediawiki keys.txt with UTF-8 charset - https://phabricator.wikimedia.org/T428772#12005480 (10Krinkle) [15:43:23] 10Wikimedia-Apache-configuration, 06ServiceOps new, 10ServiceOps-Mediawiki: Serve mediawiki keys.txt with UTF-8 charset - https://phabricator.wikimedia.org/T428772#12005485 (10Krinkle) [15:43:28] 10Wikimedia-Apache-configuration, 06ServiceOps new, 10ServiceOps-Mediawiki, 06MediaWiki-Platform-Team (Radar): Serve mediawiki keys.txt with UTF-8 charset - https://phabricator.wikimedia.org/T428772#12005486 (10Krinkle) [15:52:50] FIRING: [2x] DnsboxServiceMismatch: Service ntp-a state mismatch on dns3003:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [15:57:50] RESOLVED: [2x] DnsboxServiceMismatch: Service ntp-a state mismatch on dns3003:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [16:03:20] FIRING: DnsboxServiceMismatch: Service ntp-a state mismatch on dns4003:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://grafana.wikimedia.org/d/96fb573c-0f3c-456a-886c-e50c29f3ed48/dns-box-service-state?var-site=ulsfo&var-instance=dns4003:9100 - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [16:08:20] RESOLVED: [2x] DnsboxServiceMismatch: Service ntp-b state mismatch on dns3004:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [16:14:50] FIRING: [3x] DnsboxServiceMismatch: Service ntp-b state mismatch on dns3004:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [16:19:50] RESOLVED: [2x] DnsboxServiceMismatch: Service ntp-a state mismatch on dns4003:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [16:25:05] FIRING: [3x] DnsboxServiceMismatch: Service ntp-a state mismatch on dns4003:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [16:26:42] 10netops, 06Infrastructure-Foundations, 06SRE: Network QoS: use the 'CS1' DSCP code point for low-priority instead of AF41 - https://phabricator.wikimedia.org/T424640#12005652 (10cmooney) 05Open→03Resolved All work on this is now complete. [16:28:50] RESOLVED: [2x] DnsboxServiceMismatch: Service ntp-b state mismatch on dns4004:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [16:34:50] FIRING: [2x] DnsboxServiceMismatch: Service ntp-a state mismatch on dns5003:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [16:39:50] RESOLVED: [2x] DnsboxServiceMismatch: Service ntp-a state mismatch on dns5003:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [16:44:50] FIRING: [3x] DnsboxServiceMismatch: Service ntp-a state mismatch on dns5003:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [16:49:50] RESOLVED: [2x] DnsboxServiceMismatch: Service ntp-b state mismatch on dns5004:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [16:49:53] sigh [16:49:57] it should certainly not spam this much [16:51:24] it's just making up for the fewer VarnishHighThreadCount errors [16:52:32] :) [16:54:50] FIRING: [3x] DnsboxServiceMismatch: Service ntp-b state mismatch on dns5004:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [16:59:50] RESOLVED: [2x] DnsboxServiceMismatch: Service ntp-a state mismatch on dns6001:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [17:00:00] does that just fire every time the state is different than it was last time, or? [17:01:28] that alert fires if we say that the service is pooled in confd/confctl but we are not advertising the VIP [17:02:06] so in this case, it should not fire at all because we depool ntp.* (a,b,c) and then restart and then repool [17:02:18] so it's a stale alert, or our check is too strict, or both [17:03:36] though the last time we did this, we didn't get many (any?) false positives [17:05:39] so the order of operation in theory is: depool ntp-.*, restart ntpsec (one can argue that we probably shouldn't even depool for such a restart?), repool, remove downtime [17:05:57] except that the downtime is removed but the alert still fires [17:07:48] > one can argue that we probably shouldn't even depool for such a restart? [17:09:09] curious to hear what you feel but I think we can probably do away with the depool for a restart. I did this for the sync time but surely the other redundancy should kick in if it comes to that? [17:10:57] I think so long as the rollout strategy is cautious (host at a time, some kind of verification along the way that everything's not dying as it goes), depool shouldn't be necc [17:11:35] I mean, technically it means a few client reqs get lost, but if it's not down long, it's not going to matter to clock drift. things will just resync afterwards and be fine. [17:11:46] ok. yeah, I am begining to wonder if it is too strict. [17:11:47] batch_default = 1 [17:11:52] batch_max = 1 [17:11:55] # 10 minutes is probably the minimum acceptable time in between the restart [17:11:58] # of ntpsec.service to establish some NTP sync with the public pools or the [17:12:01] # other hosts. [17:12:04] min_grace_sleep = 600 [17:12:13] so these safeguards should be decent I feel (they are from https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/dns/roll-restart-ntp.py) [17:12:50] yeah I'd think it'd be fine to roll like that without BGP-level depooling [17:12:54] and then for the verification on the DNS hosts themselves, "NTP peers and stratum check" should kick in [17:13:20] the actual downtime in the middle is tiny anyways, just a restart [17:13:23] yep [17:51:11] 06Traffic, 06Data-Engineering (Q4 FS25/26 April 1st - June 30st), 13Patch-For-Review: Add X-Provenance data to webrequest_sampled_live - https://phabricator.wikimedia.org/T427068#12006203 (10Ahoelzl) a:03Ahoelzl [18:17:22] 06Traffic, 06Infrastructure-Foundations: eqsin: re-image rack 604 servers on new vlan - https://phabricator.wikimedia.org/T428229#12006311 (10BCornwall) [18:49:00] 10Wikimedia-Apache-configuration, 06ServiceOps new, 10ServiceOps-Mediawiki, 06MediaWiki-Platform-Team (Radar): Serve mediawiki keys.txt with UTF-8 charset - https://phabricator.wikimedia.org/T428772#12006469 (10matmarex) I think `charset=utf-8` would be correct not just for keys.txt, but also for all .txt... [19:49:23] 06Traffic, 10Liberica, 10Prod-Kubernetes, 07Kubernetes, 06ServiceOps new (Next quarter): Add missing wikikube workers to conftool-data - https://phabricator.wikimedia.org/T420729#12006716 (10MLechvien-WMF) [20:14:05] 06Traffic, 06Fundraising-Backlog, 06Fundraising-Tech-Roadmap, 10MediaWiki-extensions-CentralNotice, 06SRE: Set expiry time for GeoIP cookies - https://phabricator.wikimedia.org/T122097#12006862 (10AKanji-WMF) @Pcoombe could you please advise as to whether this is something we should/can resolve in the ne...