[09:53:40] <jinxer-wm>	 FIRING: VarnishPrometheusExporterDown: Varnish Exporter on instance cp7006:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown
[09:58:40] <jinxer-wm>	 RESOLVED: VarnishPrometheusExporterDown: Varnish Exporter on instance cp7006:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown
[10:08:40] <jinxer-wm>	 FIRING: VarnishPrometheusExporterDown: Varnish Exporter on instance cp7008:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown
[10:13:40] <jinxer-wm>	 RESOLVED: [2x] VarnishPrometheusExporterDown: Varnish Exporter on instance cp7006:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown
[12:16:11] <jinxer-wm>	 FIRING: [2x] SLOMetricAbsent: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[12:32:22] <effie>	 Dear traffic, I will start depooling kafka-main1002, so to replace it with kafka-main1007
[12:34:36] <effie>	 last time things went ok, I hope it iwill be tha same this time 
[12:36:28] <effie>	 I will be out between 13:10-14:00 UTC, but it will be during the time we are copying stuff from one kafka to another 
[13:06:11] <jinxer-wm>	 FIRING: [2x] SLOMetricAbsent: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[13:11:11] <jinxer-wm>	 RESOLVED: [2x] SLOMetricAbsent: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[13:18:00] <jinxer-wm>	 FIRING: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on doh7002:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=magru&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted
[13:23:00] <jinxer-wm>	 RESOLVED: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on doh7002:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=magru&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted
[13:28:00] <jinxer-wm>	 FIRING: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on durum7002:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=magru&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted
[13:33:00] <jinxer-wm>	 RESOLVED: [2x] AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on doh7002:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=magru&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted
[13:37:23] <wikibugs>	 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10361480 (10MoritzMuehlenhoff)
[14:21:00] <jinxer-wm>	 FIRING: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on durum7001:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=magru&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted
[14:26:00] <jinxer-wm>	 RESOLVED: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on durum7001:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=magru&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted
[14:31:30] <jinxer-wm>	 FIRING: [2x] AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on doh7001:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=magru&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted
[14:36:30] <jinxer-wm>	 RESOLVED: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on durum7001:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=magru&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted
[14:52:21] <wikibugs>	 10netops, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops: WikiKube clusters close to exhausting Calico IPPool allocations - https://phabricator.wikimedia.org/T375845#10361841 (10JMeybohm) We're not expecting any more replacements/expansions for wikikube this FY. So we can switch to the `/17`...
[16:19:22] <wikibugs>	 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru, 13Patch-For-Review: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10362331 (10Fabfur)
[16:26:26] <wikibugs>	 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru, 13Patch-For-Review: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10362368 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp7010...
[16:39:00] <jinxer-wm>	 FIRING: [3x] PurgedHighEventLag: High event process lag with purged on cp6008:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts  - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag
[16:39:31] <sukhe>	 cp6008
[16:46:09] <jinxer-wm>	 FIRING: [4x] LVSHighCPU: The host lvs1018:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1018 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU
[16:51:09] <jinxer-wm>	 RESOLVED: [4x] LVSHighCPU: The host lvs1018:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1018 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU
[17:14:00] <jinxer-wm>	 FIRING: [4x] PurgedHighEventLag: High event process lag with purged on cp6008:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts  - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag
[17:17:07] <wikibugs>	 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10362709 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp7010.magru.wmnet with OS bulls...
[17:52:03] <wikibugs>	 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10362844 (10Fabfur) lvs7003 has been restarted after cable swap, all fine
[17:52:09] <wikibugs>	 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10362843 (10Fabfur) Reverted https://gerrit.wikimedia.org/r/c/operations/puppet/+/1098573 and ran puppet agent on `A:cp-magru`: NOOP as ex...
[18:06:55] <wikibugs>	 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10362873 (10Fabfur) BGP flag enabled on NetBox for lvs700[1-3] and dns700[12]
[18:40:06] <wikibugs>	 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10363009 (10Fabfur) Removed downtime from all lvs, dns and cp hosts in magru
[18:49:00] <jinxer-wm>	 FIRING: PurgedHighEventLag: High event process lag with purged on cp7006:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=magru%20prometheus/ops&var-instance=cp7006 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag
[18:49:41] <wikibugs>	 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10363039 (10Fabfur) Repooled dnsbox cluster and run authdns-update
[18:54:00] <jinxer-wm>	 FIRING: PurgedHighBacklogQueue: Large backlog queue for purged on cp7006:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=magru%20prometheus/ops&var-instance=cp7006 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue
[18:54:00] <jinxer-wm>	 RESOLVED: PurgedHighEventLag: High event process lag with purged on cp7006:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=magru%20prometheus/ops&var-instance=cp7006 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag
[18:54:32] <brett>	 ^I restarted cp7006's purged and it seems to be working through the queue
[18:54:38] <sukhe>	 ah ok great
[18:54:59] <fabfur>	 👍
[18:55:14] <brett>	 I'll restart 7009's as well
[18:55:21] <sukhe>	 ok
[18:55:51] <brett>	 done
[19:00:57] <wikibugs>	 10netops, 10Ceph, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Configure DSCP marking for cloudceph* hosts - https://phabricator.wikimedia.org/T371501#10363115 (10dcaro) A quick search did not find any reference for the mon option on the upstream ceph, but found a commit on a clone:  http://w...
[19:04:00] <jinxer-wm>	 RESOLVED: [2x] PurgedHighBacklogQueue: Large backlog queue for purged on cp7006:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=magru%20prometheus/ops&var-instance=cp7006 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue
[19:09:28] <wikibugs>	 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10363181 (10Fabfur) ran puppet-agent on `A:magru`
[19:12:15] <wikibugs>	 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10363189 (10Fabfur) Repooled all depooled cp hosts before repooling whole DC
[19:21:56] <wikibugs>	 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10363224 (10Fabfur) Repooled magru DC
[20:18:03] <wikibugs>	 06Traffic, 07User-notice: Remove RSA certificates and use only ECDSA certificates - https://phabricator.wikimedia.org/T370837#10363642 (10Quiddity) Hi, I believe this change probably deserves an entry in Tech News. The last similar change that I'm aware of, was announced using this wording (below).  Please cou...
[22:26:13] <wikibugs>	 06Traffic, 07User-notice: Remove RSA certificates and use only ECDSA certificates - https://phabricator.wikimedia.org/T370837#10363984 (10BCornwall) @Quiddity There's some verbiage on https://en.wikipedia.org/sec-warning that you could use, e.g.:  > Wikimedia projects, including Wikipedia, are getting more sec...
[23:20:56] <wikibugs>	 06Traffic, 07User-notice: Remove RSA certificates and use only ECDSA certificates - https://phabricator.wikimedia.org/T370837#10364150 (10Quiddity) Thank you! For the record (or in case edits are needed before it is frozen on Friday), I've added it to https://meta.wikimedia.org/wiki/Tech/News/2024/49 using the...