[09:53:40] FIRING: VarnishPrometheusExporterDown: Varnish Exporter on instance cp7006:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [09:58:40] RESOLVED: VarnishPrometheusExporterDown: Varnish Exporter on instance cp7006:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [10:08:40] FIRING: VarnishPrometheusExporterDown: Varnish Exporter on instance cp7008:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [10:13:40] RESOLVED: [2x] VarnishPrometheusExporterDown: Varnish Exporter on instance cp7006:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [12:16:11] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [12:32:22] Dear traffic, I will start depooling kafka-main1002, so to replace it with kafka-main1007 [12:34:36] last time things went ok, I hope it iwill be tha same this time [12:36:28] I will be out between 13:10-14:00 UTC, but it will be during the time we are copying stuff from one kafka to another [13:06:11] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [13:11:11] RESOLVED: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [13:18:00] FIRING: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on doh7002:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=magru&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [13:23:00] RESOLVED: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on doh7002:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=magru&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [13:28:00] FIRING: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on durum7002:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=magru&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [13:33:00] RESOLVED: [2x] AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on doh7002:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=magru&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [13:37:23] 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10361480 (10MoritzMuehlenhoff) [14:21:00] FIRING: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on durum7001:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=magru&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [14:26:00] RESOLVED: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on durum7001:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=magru&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [14:31:30] FIRING: [2x] AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on doh7001:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=magru&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [14:36:30] RESOLVED: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on durum7001:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=magru&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [14:52:21] 10netops, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops: WikiKube clusters close to exhausting Calico IPPool allocations - https://phabricator.wikimedia.org/T375845#10361841 (10JMeybohm) We're not expecting any more replacements/expansions for wikikube this FY. So we can switch to the `/17`... [16:19:22] 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru, 13Patch-For-Review: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10362331 (10Fabfur) [16:26:26] 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru, 13Patch-For-Review: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10362368 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp7010... [16:39:00] FIRING: [3x] PurgedHighEventLag: High event process lag with purged on cp6008:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [16:39:31] cp6008 [16:46:09] FIRING: [4x] LVSHighCPU: The host lvs1018:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1018 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [16:51:09] RESOLVED: [4x] LVSHighCPU: The host lvs1018:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1018 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [17:14:00] FIRING: [4x] PurgedHighEventLag: High event process lag with purged on cp6008:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [17:17:07] 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10362709 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp7010.magru.wmnet with OS bulls... [17:52:03] 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10362844 (10Fabfur) lvs7003 has been restarted after cable swap, all fine [17:52:09] 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10362843 (10Fabfur) Reverted https://gerrit.wikimedia.org/r/c/operations/puppet/+/1098573 and ran puppet agent on `A:cp-magru`: NOOP as ex... [18:06:55] 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10362873 (10Fabfur) BGP flag enabled on NetBox for lvs700[1-3] and dns700[12] [18:40:06] 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10363009 (10Fabfur) Removed downtime from all lvs, dns and cp hosts in magru [18:49:00] FIRING: PurgedHighEventLag: High event process lag with purged on cp7006:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=magru%20prometheus/ops&var-instance=cp7006 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [18:49:41] 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10363039 (10Fabfur) Repooled dnsbox cluster and run authdns-update [18:54:00] FIRING: PurgedHighBacklogQueue: Large backlog queue for purged on cp7006:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=magru%20prometheus/ops&var-instance=cp7006 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [18:54:00] RESOLVED: PurgedHighEventLag: High event process lag with purged on cp7006:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=magru%20prometheus/ops&var-instance=cp7006 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [18:54:32] ^I restarted cp7006's purged and it seems to be working through the queue [18:54:38] ah ok great [18:54:59] 👍 [18:55:14] I'll restart 7009's as well [18:55:21] ok [18:55:51] done [19:00:57] 10netops, 10Ceph, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Configure DSCP marking for cloudceph* hosts - https://phabricator.wikimedia.org/T371501#10363115 (10dcaro) A quick search did not find any reference for the mon option on the upstream ceph, but found a commit on a clone: http://w... [19:04:00] RESOLVED: [2x] PurgedHighBacklogQueue: Large backlog queue for purged on cp7006:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=magru%20prometheus/ops&var-instance=cp7006 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [19:09:28] 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10363181 (10Fabfur) ran puppet-agent on `A:magru` [19:12:15] 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10363189 (10Fabfur) Repooled all depooled cp hosts before repooling whole DC [19:21:56] 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10363224 (10Fabfur) Repooled magru DC [20:18:03] 06Traffic, 07User-notice: Remove RSA certificates and use only ECDSA certificates - https://phabricator.wikimedia.org/T370837#10363642 (10Quiddity) Hi, I believe this change probably deserves an entry in Tech News. The last similar change that I'm aware of, was announced using this wording (below). Please cou... [22:26:13] 06Traffic, 07User-notice: Remove RSA certificates and use only ECDSA certificates - https://phabricator.wikimedia.org/T370837#10363984 (10BCornwall) @Quiddity There's some verbiage on https://en.wikipedia.org/sec-warning that you could use, e.g.: > Wikimedia projects, including Wikipedia, are getting more sec... [23:20:56] 06Traffic, 07User-notice: Remove RSA certificates and use only ECDSA certificates - https://phabricator.wikimedia.org/T370837#10364150 (10Quiddity) Thank you! For the record (or in case edits are needed before it is frozen on Friday), I've added it to https://meta.wikimedia.org/wiki/Tech/News/2024/49 using the...