[00:04:21] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1012.eqiad.wmnet with OS bookworm
[00:04:29] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10212901 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host backup1012.eqiad.wmnet with OS bookworm execu...
[00:10:48] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1078765 (owner: 10TrainBranchBot)
[00:15:32] <wikibugs>	 (03CR) 10ZhaoFJx: "Thanks for letting me know!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078764 (https://phabricator.wikimedia.org/T376061) (owner: 10ZhaoFJx)
[00:19:55] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10212902 (10phaultfinder)
[00:43:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[00:57:57] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[00:59:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10212920 (10phaultfinder)
[01:14:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:16:13] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:25:15] <wikibugs>	 (03CR) 10Hamish: Configure ContactPage and IPBE contact form on zhwiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072876 (https://phabricator.wikimedia.org/T359998) (owner: 10Hamish)
[01:29:18] <wikibugs>	 (03CR) 10Hamish: [C:03+1] zhwiki: Allow event-organizer self remove usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078764 (https://phabricator.wikimedia.org/T376061) (owner: 10ZhaoFJx)
[01:36:15] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:41:15] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:49:59] <wikibugs>	 (03PS1) 10Hamish: zhwiki: Revise contact page field usage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078773
[01:59:43] <wikibugs>	 (03PS1) 10Albertoleoncio: [brwikimedia] Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078774 (https://phabricator.wikimedia.org/T376747)
[02:04:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10212935 (10phaultfinder)
[02:08:12] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078774 (https://phabricator.wikimedia.org/T376747) (owner: 10Albertoleoncio)
[02:14:53] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:24:53] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:36:13] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:39:40] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:41:10] <jinxer-wm>	 FIRING: SystemdUnitFailed: routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:44:52] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10212956 (10phaultfinder)
[02:51:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:52:00] <jinxer-wm>	 FIRING: [2x] CertAlmostExpired: Certificate for service cloudidm2001-dev:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#cloudidm2001-dev:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[03:01:13] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:09:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10212964 (10phaultfinder)
[03:10:56] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10212965 (10Papaul)
[03:37:59] <jinxer-wm>	 FIRING: [2x] CertAlmostExpired: Certificate for service echostore:8082 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#echostore:8082 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[04:09:33] <jinxer-wm>	 FIRING: KubernetesCalicoDown: kubernetes2056.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2056.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[04:11:44] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[04:11:44] <jinxer-wm>	 Deployment cfssl-issuer in cert-manager at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=cert-manager&var-deployment=cfssl-issuer - ...
[04:11:44] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[04:14:33] <jinxer-wm>	 FIRING: [2x] KubernetesCalicoDown: kubernetes2056.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[04:16:44] <jinxer-wm>	 FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment cert-manager in cert-manager at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[04:19:33] <jinxer-wm>	 FIRING: [2x] KubernetesCalicoDown: wikikube-worker2041.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[04:19:47] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10212990 (10phaultfinder)
[04:29:33] <jinxer-wm>	 RESOLVED: KubernetesCalicoDown: wikikube-worker2059.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2059.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[04:33:33] <jinxer-wm>	 FIRING: KubernetesCalicoDown: mw2447.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=mw2447.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[04:38:33] <jinxer-wm>	 RESOLVED: [3x] KubernetesCalicoDown: mw2337.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[04:41:44] <jinxer-wm>	 FIRING: [3x] KubernetesDeploymentUnavailableReplicas: Deployment cert-manager in cert-manager at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[04:43:33] <jinxer-wm>	 FIRING: KubernetesCalicoDown: wikikube-worker2074.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2074.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[04:48:33] <jinxer-wm>	 FIRING: [5x] KubernetesCalicoDown: mw2437.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[04:53:33] <jinxer-wm>	 FIRING: [9x] KubernetesCalicoDown: mw2310.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[04:57:57] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[04:58:33] <jinxer-wm>	 FIRING: [13x] KubernetesCalicoDown: kubernetes2038.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[05:01:36] <jinxer-wm>	 FIRING: GatewayBackendErrorsHigh: rest-gateway: elevated 5xx errors from page-analytics_cluster in codfw #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=codfw%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh
[05:03:33] <jinxer-wm>	 FIRING: [16x] KubernetesCalicoDown: kubernetes2037.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[05:04:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[05:05:04] <_joe_>	 I am at the gym
[05:05:32] <_joe_>	 I can’t get home in less than 30 minutes
[05:06:36] <jinxer-wm>	 RESOLVED: GatewayBackendErrorsHigh: rest-gateway: elevated 5xx errors from page-analytics_cluster in codfw #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=codfw%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh
[05:08:33] <jinxer-wm>	 FIRING: [16x] KubernetesCalicoDown: kubernetes2037.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[05:09:15] <jinxer-wm>	 RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[05:11:45] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable
[05:12:15] <jinxer-wm>	 FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[05:12:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[05:12:31] <swfrench-wmf>	 _joe_: just came in to shut down my computer for the night and now reading through back-scroll ... I'll start looking at the KubernetesCalicoDown alerts as a starting point, as I assume they're the source of this
[05:12:41] <swfrench-wmf>	 moving to -sre with lower noise
[05:13:33] <jinxer-wm>	 FIRING: [20x] KubernetesCalicoDown: kubernetes2037.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[05:14:47] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10213049 (10phaultfinder)
[05:17:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[05:17:15] <jinxer-wm>	 RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[05:17:37] <jinxer-wm>	 FIRING: ProbeDown: Service miscweb2003:30443 has failed probes (http_static_codereview_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:17:45] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10SRE Observability (FY2024/2025-Q1): Degraded RAID on prometheus1008 - https://phabricator.wikimedia.org/T374540#10213053 (10VRiley-WMF) 05Open→03Resolved Thanks! I'll resolve this for now. Feel free to reopen if the issue crops up again.
[05:18:33] <jinxer-wm>	 FIRING: [23x] KubernetesCalicoDown: kubernetes2037.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[05:19:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[05:20:15] <jinxer-wm>	 FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[05:20:51] <jinxer-wm>	 FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy  - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[05:22:37] <jinxer-wm>	 RESOLVED: ProbeDown: Service miscweb2003:30443 has failed probes (http_static_codereview_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:23:33] <jinxer-wm>	 FIRING: [37x] KubernetesCalicoDown: kubernetes2011.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[05:23:33] <jinxer-wm>	 FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown
[05:24:15] <jinxer-wm>	 RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[05:25:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[05:25:51] <jinxer-wm>	 FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy  - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[05:28:33] <jinxer-wm>	 FIRING: [40x] KubernetesCalicoDown: kubernetes2011.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[05:28:33] <jinxer-wm>	 RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown
[05:28:37] <jinxer-wm>	 FIRING: ProbeDown: Service miscweb2003:30443 has failed probes (http_static_codereview_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:28:48] <jinxer-wm>	 FIRING: [40x] KubernetesCalicoDown: kubernetes2011.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[05:29:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[05:30:51] <jinxer-wm>	 FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy  - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[05:33:33] <jinxer-wm>	 FIRING: [53x] KubernetesCalicoDown: kubernetes2011.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[05:33:37] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Service miscweb2003:30443 has failed probes (http_bienvenida_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:33:48] <jinxer-wm>	 FIRING: [53x] KubernetesCalicoDown: kubernetes2011.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[05:34:15] <jinxer-wm>	 FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[05:34:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[05:35:56] <jinxer-wm>	 FIRING: WcqsStreamingUpdaterFlinkJobNotRunning: WCQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=commons - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning
[05:36:02] <jinxer-wm>	 FIRING: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:38:33] <jinxer-wm>	 FIRING: [69x] KubernetesCalicoDown: kubernetes2011.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[05:38:52] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service miscweb2003:30443 has failed probes (http_bienvenida_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:39:15] <jinxer-wm>	 FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[05:39:15] <jinxer-wm>	 RESOLVED: [3x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[05:39:53] <akosiaris>	 ο/
[05:40:02] <jinxer-wm>	 FIRING: RdfStreamingUpdaterFlinkJobUnstable: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=commons - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[05:40:51] <jinxer-wm>	 FIRING: [5x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy  - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[05:40:56] <jinxer-wm>	 RESOLVED: WcqsStreamingUpdaterFlinkJobNotRunning: WCQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=commons - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning
[05:41:02] <jinxer-wm>	 RESOLVED: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:41:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[05:41:45] <jinxer-wm>	 RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable
[05:42:31] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:42:43] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable
[05:43:33] <jinxer-wm>	 FIRING: [85x] KubernetesCalicoDown: kubernetes2011.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[05:43:48] <jinxer-wm>	 FIRING: [85x] KubernetesCalicoDown: kubernetes2011.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[05:43:51] <jinxer-wm>	 FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from restbase.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=text&var-origin=restbase.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[05:43:57] <jinxer-wm>	 RESOLVED: [6x] ProbeDown: Service miscweb2003:30443 has failed probes (http_bienvenida_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:44:15] <jinxer-wm>	 RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[05:44:27] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:45:51] <jinxer-wm>	 FIRING: [9x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[05:45:57] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service miscweb2003:30443 has failed probes (http_research_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:46:06] <jinxer-wm>	 FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[05:46:15] <jinxer-wm>	 FIRING: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[05:46:15] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service citoid:4003 has failed probes (http_citoid_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:48:04] <jinxer-wm>	 FIRING: [2x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable
[05:48:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 3.75s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[05:48:33] <jinxer-wm>	 FIRING: [114x] KubernetesCalicoDown: kubernetes2011.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[05:48:43] <jinxer-wm>	 FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[05:48:48] <jinxer-wm>	 FIRING: [114x] KubernetesCalicoDown: kubernetes2011.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[05:48:51] <jinxer-wm>	 FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from restbase.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[05:49:51] <jinxer-wm>	 FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from kartotherian.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=upload&var-origin=kartotherian.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[05:49:56] <jinxer-wm>	 FIRING: [2x] RdfStreamingUpdaterFlinkJobUnstable: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[05:50:38] <akosiaris>	 !incidents
[05:50:38] <sirenbot>	 5300 (ACKED)  Manual (paged) by Scott French (swfrench@wikimedia.org): need assistance - calico issues in codfw (please join #wikimedia-sre)
[05:50:38] <sirenbot>	 5302 (ACKED)  ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqsin)
[05:50:39] <sirenbot>	 5303 (ACKED)  ProbeDown sre (ip4 probes/service codfw)
[05:50:39] <sirenbot>	 5304 (UNACKED)  ATSBackendErrorsHigh cache_upload sre (kartotherian.discovery.wmnet eqsin)
[05:50:39] <sirenbot>	 5301 (RESOLVED)  ProbeDown sre (10.2.1.88 ip4 mw-wikifunctions:4451 probes/service http_mw-wikifunctions_ip4 codfw)
[05:50:39] <sirenbot>	 5299 (RESOLVED)  GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw)
[05:50:50] <akosiaris>	 !ack 5304
[05:50:50] <sirenbot>	 5304 (ACKED)  ATSBackendErrorsHigh cache_upload sre (kartotherian.discovery.wmnet eqsin)
[05:50:51] <jinxer-wm>	 FIRING: [9x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[05:50:57] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service miscweb2003:30443 has failed probes (http_design_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:51:11] <akosiaris>	 I have no idea why kartotherian is complaining, but higher priority stuff exists right now
[05:51:15] <jinxer-wm>	 FIRING: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[05:51:15] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service citoid:4003 has failed probes (http_citoid_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:51:55] <jinxer-wm>	 FIRING: [3x] KubernetesDeploymentUnavailableReplicas: Deployment cert-manager in cert-manager at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[05:53:20] <jinxer-wm>	 FIRING: [2x] CertAlmostExpired: Certificate for service echostore:8082 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#echostore:8082 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[05:53:26] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[05:53:33] <jinxer-wm>	 FIRING: [132x] KubernetesCalicoDown: kubernetes2011.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[05:54:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 7.831% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[05:54:45] <jinxer-wm>	 FIRING: CirrusProducerFlinkJobNotRunning: cirrus_streaming_updater_producer in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusProducerFlinkJobNotRunning
[05:55:39] <jinxer-wm>	 FIRING: CirrusSearchUpdaterKafkaMessagesInTooLow: ...
[05:55:39] <jinxer-wm>	 The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.rc0` is too low - TODO - ...
[05:55:45] <jinxer-wm>	 https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&var-dc=codfw%2520prometheus%252Fops&var-kafka_cluster=main-eqiad&var-kafka_broker=All&from=now-1h&to=now&refresh=5m&var-topic=codfw.cirrussearch.update_pipeline.update.rc0&var-topic=eqcodfw.cirrussearch.update_pipeline.update.rc0&var-topic=eqiad.cirrussearch.update_pipeline.update.rc0&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesI
[05:56:00] <jinxer-wm>	 FIRING: [7x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[05:56:05] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service miscweb2003:30443 has failed probes (http_15_wikipedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:56:15] <jinxer-wm>	 FIRING: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[05:56:19] <jinxer-wm>	 FIRING: [7x] ProbeDown: Service cxserver:4002 has failed probes (http_cxserver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:56:33] <jinxer-wm>	 FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown
[05:56:36] <jinxer-wm>	 FIRING: GatewayBackendErrorsHigh: rest-gateway: elevated 5xx errors from page-analytics_cluster in codfw #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=codfw%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh
[05:56:55] <jinxer-wm>	 RESOLVED: [3x] KubernetesDeploymentUnavailableReplicas: Deployment cert-manager in cert-manager at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[05:57:59] <jinxer-wm>	 FIRING: [2x] CertAlmostExpired: Certificate for service echostore:8082 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#echostore:8082 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[05:58:33] <jinxer-wm>	 FIRING: [147x] KubernetesCalicoDown: kubernetes2011.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[05:59:10] <wikibugs>	 (03PS1) 10Jelto: admin: bump calico resources in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078791
[05:59:15] <jinxer-wm>	 RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at codfw: 24.46% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[05:59:27] <jinxer-wm>	 RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:59:34] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] admin: bump calico resources in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078791 (owner: 10Jelto)
[05:59:45] <jinxer-wm>	 RESOLVED: CirrusProducerFlinkJobNotRunning: cirrus_streaming_updater_producer in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusProducerFlinkJobNotRunning
[06:00:00] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] admin: bump calico resources in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078791 (owner: 10Jelto)
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241009T0600)
[06:00:49] <wikibugs>	 (03CR) 10Jelto: [V:03+2] admin: bump calico resources in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078791 (owner: 10Jelto)
[06:00:51] <jinxer-wm>	 FIRING: [8x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[06:00:52] <wikibugs>	 (03CR) 10Jelto: [V:03+2 C:03+2] admin: bump calico resources in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078791 (owner: 10Jelto)
[06:00:57] <jinxer-wm>	 RESOLVED: [11x] ProbeDown: Service miscweb2003:30443 has failed probes (http_15_wikipedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:01:15] <jinxer-wm>	 FIRING: [5x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[06:01:15] <jinxer-wm>	 RESOLVED: [6x] ProbeDown: Service cxserver:4002 has failed probes (http_cxserver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:01:33] <jinxer-wm>	 RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown
[06:01:36] <jinxer-wm>	 RESOLVED: GatewayBackendErrorsHigh: rest-gateway: elevated 5xx errors from page-analytics_cluster in codfw #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=codfw%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh
[06:01:55] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow
[06:03:01] <akosiaris>	 !incidents
[06:03:01] <sirenbot>	 5300 (ACKED)  Manual (paged) by Scott French (swfrench@wikimedia.org): need assistance - calico issues in codfw (please join #wikimedia-sre)
[06:03:02] <sirenbot>	 5302 (ACKED)  ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqsin)
[06:03:02] <sirenbot>	 5304 (ACKED)  ATSBackendErrorsHigh cache_upload sre (kartotherian.discovery.wmnet eqsin)
[06:03:02] <sirenbot>	 5306 (UNACKED)  [2x] ProbeDown sre (ip4 probes/service codfw)
[06:03:02] <sirenbot>	 5305 (RESOLVED)  GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw)
[06:03:03] <sirenbot>	 5303 (RESOLVED)  ProbeDown sre (ip4 probes/service codfw)
[06:03:03] <sirenbot>	 5301 (RESOLVED)  ProbeDown sre (10.2.1.88 ip4 mw-wikifunctions:4451 probes/service http_mw-wikifunctions_ip4 codfw)
[06:03:03] <sirenbot>	 5299 (RESOLVED)  GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw)
[06:03:07] <akosiaris>	 !ack 5306
[06:03:07] <sirenbot>	 5306 (ACKED)  [2x] ProbeDown sre (ip4 probes/service codfw)
[06:03:10] <akosiaris>	 !incidents
[06:03:10] <sirenbot>	 5300 (ACKED)  Manual (paged) by Scott French (swfrench@wikimedia.org): need assistance - calico issues in codfw (please join #wikimedia-sre)
[06:03:10] <sirenbot>	 5302 (ACKED)  ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqsin)
[06:03:10] <sirenbot>	 5304 (ACKED)  ATSBackendErrorsHigh cache_upload sre (kartotherian.discovery.wmnet eqsin)
[06:03:11] <sirenbot>	 5306 (ACKED)  [2x] ProbeDown sre (ip4 probes/service codfw)
[06:03:11] <sirenbot>	 5305 (RESOLVED)  GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw)
[06:03:11] <sirenbot>	 5303 (RESOLVED)  ProbeDown sre (ip4 probes/service codfw)
[06:03:11] <sirenbot>	 5301 (RESOLVED)  ProbeDown sre (10.2.1.88 ip4 mw-wikifunctions:4451 probes/service http_mw-wikifunctions_ip4 codfw)
[06:03:12] <sirenbot>	 5299 (RESOLVED)  GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw)
[06:03:50] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service miscweb2003:30443 has failed probes (http_transparency_archive_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:04:17] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:04:21] <jinxer-wm>	 FIRING: [2x] CertAlmostExpired: Certificate for service echostore:8082 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#echostore:8082 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[06:04:26] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 1.974% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[06:04:30] <jinxer-wm>	 FIRING: [145x] KubernetesCalicoDown: kubernetes2011.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[06:04:36] <jinxer-wm>	 FIRING: [145x] KubernetesCalicoDown: kubernetes2011.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[06:04:45] <jinxer-wm>	 FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[06:04:57] <jinxer-wm>	 FIRING: WdqsStreamingUpdaterFlinkJobNotRunning: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning
[06:05:02] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:05:06] <jinxer-wm>	 FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at codfw: 16.51% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[06:05:17] <jinxer-wm>	 FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from kartotherian.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[06:05:33] <jinxer-wm>	 FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown
[06:05:45] <jinxer-wm>	 FIRING: CirrusProducerFlinkJobNotRunning: cirrus_streaming_updater_producer in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusProducerFlinkJobNotRunning
[06:05:51] <jinxer-wm>	 FIRING: [10x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[06:06:06] <jinxer-wm>	 RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[06:06:12] <jinxer-wm>	 RESOLVED: [12x] ProbeDown: Service miscweb2003:30443 has failed probes (http_15_wikipedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:06:13] <logmsgbot>	 !log jelto@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'.
[06:06:15] <jinxer-wm>	 FIRING: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[06:06:19] <jinxer-wm>	 FIRING: [16x] ProbeDown: Service citoid:4003 has failed probes (http_citoid_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:06:45] <jinxer-wm>	 FIRING: [2x] CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow
[06:07:27] <jinxer-wm>	 FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[06:07:57] <jinxer-wm>	 RESOLVED: [3x] ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:07:59] <jinxer-wm>	 FIRING: [2x] CertAlmostExpired: Certificate for service echostore:8082 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#echostore:8082 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[06:08:15] <jinxer-wm>	 RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at codfw: 24.3% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[06:08:33] <jinxer-wm>	 FIRING: [157x] KubernetesCalicoDown: kubernetes2011.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[06:08:36] <akosiaris>	 !incidents
[06:08:37] <sirenbot>	 5300 (ACKED)  Manual (paged) by Scott French (swfrench@wikimedia.org): need assistance - calico issues in codfw (please join #wikimedia-sre)
[06:08:37] <sirenbot>	 5302 (ACKED)  ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqsin)
[06:08:37] <sirenbot>	 5304 (ACKED)  ATSBackendErrorsHigh cache_upload sre (kartotherian.discovery.wmnet eqsin)
[06:08:38] <sirenbot>	 5306 (RESOLVED)  [2x] ProbeDown sre (ip4 probes/service codfw)
[06:08:38] <sirenbot>	 5305 (RESOLVED)  GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw)
[06:08:38] <sirenbot>	 5303 (RESOLVED)  ProbeDown sre (ip4 probes/service codfw)
[06:08:38] <sirenbot>	 5301 (RESOLVED)  ProbeDown sre (10.2.1.88 ip4 mw-wikifunctions:4451 probes/service http_mw-wikifunctions_ip4 codfw)
[06:08:39] <sirenbot>	 5299 (RESOLVED)  GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw)
[06:08:51] <jinxer-wm>	 FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[06:08:52] <logmsgbot>	 !log jelto@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[06:08:56] <jinxer-wm>	 RESOLVED: WdqsStreamingUpdaterFlinkJobNotRunning: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning
[06:09:21] <jinxer-wm>	 RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:09:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:09:51] <jinxer-wm>	 RESOLVED: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from kartotherian.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[06:09:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:10:02] <jelto>	  !incidents
[06:10:02] <sirenbot>	 5300 (ACKED)  Manual (paged) by Scott French (swfrench@wikimedia.org): need assistance - calico issues in codfw (please join #wikimedia-sre)
[06:10:03] <sirenbot>	 5302 (ACKED)  ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqsin)
[06:10:03] <sirenbot>	 5304 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (kartotherian.discovery.wmnet eqsin)
[06:10:03] <sirenbot>	 5306 (RESOLVED)  [2x] ProbeDown sre (ip4 probes/service codfw)
[06:10:03] <sirenbot>	 5305 (RESOLVED)  GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw)
[06:10:03] <sirenbot>	 5303 (RESOLVED)  ProbeDown sre (ip4 probes/service codfw)
[06:10:04] <sirenbot>	 5301 (RESOLVED)  ProbeDown sre (10.2.1.88 ip4 mw-wikifunctions:4451 probes/service http_mw-wikifunctions_ip4 codfw)
[06:10:04] <sirenbot>	 5299 (RESOLVED)  GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw)
[06:10:33] <jinxer-wm>	 RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown
[06:10:41] <jinxer-wm>	 RESOLVED: CirrusSearchUpdaterKafkaMessagesInTooLow: ...
[06:10:45] <jinxer-wm>	 The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.rc0` is too low - TODO - ...
[06:10:51] <jinxer-wm>	 https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&var-dc=codfw%2520prometheus%252Fops&var-kafka_cluster=main-eqiad&var-kafka_broker=All&from=now-1h&to=now&refresh=5m&var-topic=codfw.cirrussearch.update_pipeline.update.rc0&var-topic=eqcodfw.cirrussearch.update_pipeline.update.rc0&var-topic=eqiad.cirrussearch.update_pipeline.update.rc0&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesI
[06:10:58] <jinxer-wm>	 RESOLVED: CirrusProducerFlinkJobNotRunning: cirrus_streaming_updater_producer in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusProducerFlinkJobNotRunning
[06:11:12] <jinxer-wm>	 FIRING: [9x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[06:11:19] <jinxer-wm>	 RESOLVED: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[06:11:23] <jinxer-wm>	 RESOLVED: [14x] ProbeDown: Service citoid:4003 has failed probes (http_citoid_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:11:45] <jinxer-wm>	 RESOLVED: [2x] CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow
[06:13:00] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:13:09] <jinxer-wm>	 FIRING: [2x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable
[06:13:14] <jinxer-wm>	 RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[06:13:33] <jinxer-wm>	 RESOLVED: [102x] KubernetesCalicoDown: kubernetes2021.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[06:13:43] <jinxer-wm>	 RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[06:13:51] <jinxer-wm>	 RESOLVED: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[06:14:23] <_joe_>	 !incidents
[06:14:24] <sirenbot>	 5300 (ACKED)  Manual (paged) by Scott French (swfrench@wikimedia.org): need assistance - calico issues in codfw (please join #wikimedia-sre)
[06:14:24] <sirenbot>	 5302 (RESOLVED)  ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqsin)
[06:14:24] <sirenbot>	 5304 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (kartotherian.discovery.wmnet eqsin)
[06:14:24] <sirenbot>	 5306 (RESOLVED)  [2x] ProbeDown sre (ip4 probes/service codfw)
[06:14:25] <sirenbot>	 5305 (RESOLVED)  GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw)
[06:14:25] <sirenbot>	 5303 (RESOLVED)  ProbeDown sre (ip4 probes/service codfw)
[06:14:25] <sirenbot>	 5301 (RESOLVED)  ProbeDown sre (10.2.1.88 ip4 mw-wikifunctions:4451 probes/service http_mw-wikifunctions_ip4 codfw)
[06:14:26] <sirenbot>	 5299 (RESOLVED)  GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw)
[06:14:49] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10213090 (10phaultfinder)
[06:14:56] <jinxer-wm>	 RESOLVED: [2x] RdfStreamingUpdaterFlinkJobUnstable: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[06:17:00] <jinxer-wm>	 RESOLVED: [2x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable
[06:25:51] <jinxer-wm>	 RESOLVED: [6x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[06:36:10] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ms-be1065 - https://phabricator.wikimedia.org/T376775 (10ops-monitoring-bot) 03NEW
[06:36:56] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:51:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[06:52:00] <jinxer-wm>	 FIRING: [2x] CertAlmostExpired: Certificate for service cloudidm2001-dev:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#cloudidm2001-dev:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[06:54:55] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:00:04] <jouncebot>	 Amir1 and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241009T0700).
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:00:10] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] ldapbackend: Remove post_save signal for user models. [software/bitu] - 10https://gerrit.wikimedia.org/r/1078343 (https://phabricator.wikimedia.org/T346601) (owner: 10Slyngshede)
[07:01:13] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:02:41] <wikibugs>	 (03Merged) 10jenkins-bot: ldapbackend: Remove post_save signal for user models. [software/bitu] - 10https://gerrit.wikimedia.org/r/1078343 (https://phabricator.wikimedia.org/T346601) (owner: 10Slyngshede)
[07:06:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:06:56] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:07:44] <jinxer-wm>	 FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment cert-manager in cert-manager at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[07:13:40] <moritzm>	 !log remove ganeti2010 from active nodes T376594
[07:13:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:13:43] <stashbot>	 T376594: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594
[07:16:15] <jinxer-wm>	 FIRING: ProbeDown: Service ganeti2010:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:19:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:20:34] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[07:20:56] <logmsgbot>	 !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[07:22:27] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[07:22:51] <logmsgbot>	 !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[07:26:28] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[07:26:50] <logmsgbot>	 !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[07:27:00] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: [stub] mwscript-k8s: Add concurrency limiting via poolcounter [puppet] - 10https://gerrit.wikimedia.org/r/1078796
[07:27:54] <wikibugs>	 (03CR) 10Jon Harald Søby: Missing.php: Redirect Scots Wiktionary to Scots Wikipedia (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078122 (https://phabricator.wikimedia.org/T249648) (owner: 10Pppery)
[07:28:54] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [stub] mwscript-k8s: Add concurrency limiting via poolcounter [puppet] - 10https://gerrit.wikimedia.org/r/1078796 (owner: 10Giuseppe Lavagetto)
[07:36:58] <wikibugs>	 (03PS1) 10Slyngshede: P:idm add passlib dependency. [puppet] - 10https://gerrit.wikimedia.org/r/1078875
[07:38:40] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] P:idm add passlib dependency. [puppet] - 10https://gerrit.wikimedia.org/r/1078875 (owner: 10Slyngshede)
[07:43:24] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[07:43:45] <logmsgbot>	 !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[07:45:09] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[07:46:10] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:47:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudcephosd1011.eqiad.wmnet
[07:48:31] <logmsgbot>	 !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[07:49:23] <wikibugs>	 (03CR) 10Elukey: "test-cookbooked, works fine.. Of course I found other corner cases of BIOS settings for newer Supermicro models, sigh, but unrelated to th" [cookbooks] - 10https://gerrit.wikimedia.org/r/1078726 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey)
[07:51:13] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:51:15] <jinxer-wm>	 RESOLVED: ProbeDown: Service ganeti2010:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:52:08] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch cloudcephosd1011 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1078891 (https://phabricator.wikimedia.org/T349619)
[07:53:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch cloudcephosd1011 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1078891 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[07:55:50] <wikibugs>	 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#10213162 (10elukey) Found a new interesting issue when running the provision cookbook for mc-misc2001:  ` "Message":...
[07:57:25] <wikibugs>	 (03CR) 10Elukey: "Folks I am battling with Supermicro and BIOS/UEFI in https://phabricator.wikimedia.org/T365372#10213162, there are some weird things that " [cookbooks] - 10https://gerrit.wikimedia.org/r/1077377 (https://phabricator.wikimedia.org/T373519) (owner: 10Ayounsi)
[07:58:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T376764#10213168 (10phaultfinder)
[08:00:05] <jouncebot>	 andre and hashar: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241009T0800)
[08:00:23] <andre>	 o/
[08:00:52] <wikibugs>	 (03PS2) 10Brouberol: airflow: expose non-sensitive configuration in the web UI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078689
[08:02:11] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudcephosd1011.eqiad.wmnet
[08:02:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudcephosd1021.eqiad.wmnet
[08:03:52] <andre>	 I will now start promoting group1 wikis to 1.43.0-wmf.26
[08:04:11] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 to 1.43.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078893 (https://phabricator.wikimedia.org/T375657)
[08:04:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.43.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078893 (https://phabricator.wikimedia.org/T375657) (owner: 10TrainBranchBot)
[08:04:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:05:02] <wikibugs>	 (03Merged) 10jenkins-bot: group1 to 1.43.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078893 (https://phabricator.wikimedia.org/T375657) (owner: 10TrainBranchBot)
[08:07:21] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch cloudcephosd1021 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1078894 (https://phabricator.wikimedia.org/T349619)
[08:08:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch cloudcephosd1021 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1078894 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[08:10:52] <wikibugs>	 (03PS1) 10Slyngshede: P:trafficserver::backend remove CloudIDM. [puppet] - 10https://gerrit.wikimedia.org/r/1078895
[08:11:54] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudcephosd1021.eqiad.wmnet
[08:12:00] <logmsgbot>	 !log aklapper@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.43.0-wmf.26  refs T375657
[08:12:03] <stashbot>	 T375657: 1.43.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T375657
[08:12:50] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Jonathan Tweed - https://phabricator.wikimedia.org/T376777 (10JTweed-WMF) 03NEW
[08:13:30] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4253/co" [puppet] - 10https://gerrit.wikimedia.org/r/1078895 (owner: 10Slyngshede)
[08:16:02] <wikibugs>	 (03PS1) 10Brouberol: data-platform: alert if datahub/superset pods are down for at least 5 minutes [alerts] - 10https://gerrit.wikimedia.org/r/1078897
[08:16:29] <wikibugs>	 (03CR) 10JMeybohm: "I don't think this is required. NodePort traffic will bypass the INPUT chain and is not captured by ferm." [puppet] - 10https://gerrit.wikimedia.org/r/1078719 (https://phabricator.wikimedia.org/T344171) (owner: 10CDanis)
[08:18:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudcephmon1005.eqiad.wmnet
[08:18:39] <wikibugs>	 (03PS2) 10Slyngshede: P:trafficserver::backend remove CloudIDM. [puppet] - 10https://gerrit.wikimedia.org/r/1078895
[08:19:31] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4254/co" [puppet] - 10https://gerrit.wikimedia.org/r/1078895 (owner: 10Slyngshede)
[08:19:57] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch cloudcephmon1005 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1078898 (https://phabricator.wikimedia.org/T349619)
[08:21:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch cloudcephmon1005 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1078898 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[08:22:14] <wikibugs>	 (03CR) 10Volans: [stub] mwscript-k8s: Add concurrency limiting via poolcounter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1078796 (owner: 10Giuseppe Lavagetto)
[08:22:46] <wikibugs>	 (03CR) 10Btullis: [C:03+1] Provision dummy secrets for ceph-csi users [labs/private] - 10https://gerrit.wikimedia.org/r/1078401 (https://phabricator.wikimedia.org/T376407) (owner: 10Brouberol)
[08:23:36] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host cloudcephmon1005.eqiad.wmnet
[08:23:42] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1078402 (https://phabricator.wikimedia.org/T376407) (owner: 10Brouberol)
[08:24:18] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Provision dummy secrets for ceph-csi users [labs/private] - 10https://gerrit.wikimedia.org/r/1078401 (https://phabricator.wikimedia.org/T376407) (owner: 10Brouberol)
[08:24:20] <wikibugs>	 (03CR) 10Brouberol: [V:03+2 C:03+2] Provision dummy secrets for ceph-csi users [labs/private] - 10https://gerrit.wikimedia.org/r/1078401 (https://phabricator.wikimedia.org/T376407) (owner: 10Brouberol)
[08:24:28] <wikibugs>	 (03CR) 10Brouberol: [V:03+1 C:03+2] ceph: provision the dse-k8s-csi-cephfs user capabilities [puppet] - 10https://gerrit.wikimedia.org/r/1078402 (https://phabricator.wikimedia.org/T376407) (owner: 10Brouberol)
[08:29:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:30:46] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1078726 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey)
[08:30:56] <wikibugs>	 (03CR) 10Ayounsi: ripeatlas: clean up resource defs after deletion (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1078702 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli)
[08:33:13] <wikibugs>	 (03PS2) 10Elukey: sre.hosts.provision: avoid Redfish calls before DHCP for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1078726 (https://phabricator.wikimedia.org/T365372)
[08:35:53] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] data-platform: alert if datahub/superset pods are down for at least 5 minutes [alerts] - 10https://gerrit.wikimedia.org/r/1078897 (owner: 10Brouberol)
[08:36:14] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] Change an-worker117[67] to use reuse partman recipe. [puppet] - 10https://gerrit.wikimedia.org/r/1077913 (https://phabricator.wikimedia.org/T353788) (owner: 10Stevemunene)
[08:36:44] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[08:37:06] <logmsgbot>	 !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[08:37:53] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[08:38:14] <wikibugs>	 (03PS3) 10Brouberol: airflow: align default configuration with our pupetized instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078689
[08:41:18] <logmsgbot>	 !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[08:43:21] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078689 (owner: 10Brouberol)
[08:43:52] <wikibugs>	 (03CR) 10Volans: [C:03+1] "The approach LGTM, to be tested if the host behaves as we espect." [cookbooks] - 10https://gerrit.wikimedia.org/r/1078726 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey)
[08:44:27] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: align default configuration with our pupetized instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078689 (owner: 10Brouberol)
[08:46:09] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[08:46:30] <logmsgbot>	 !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[08:47:44] <wikibugs>	 (03PS8) 10Arturo Borrero Gonzalez: prometheus: add kernel-panic detector [puppet] - 10https://gerrit.wikimedia.org/r/1078684 (https://phabricator.wikimedia.org/T376719)
[08:48:24] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: prometheus: add kernel-panic detector (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1078684 (https://phabricator.wikimedia.org/T376719) (owner: 10Arturo Borrero Gonzalez)
[08:48:46] <logmsgbot>	 !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1176.eqiad.wmnet with OS bullseye
[08:49:10] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[08:49:22] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): scap: Also exclude (my)sql from mwscript deprecation warning [puppet] - 10https://gerrit.wikimedia.org/r/1078900 (https://phabricator.wikimedia.org/T341553)
[08:49:50] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow: align default configuration with our pupetized instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078689 (owner: 10Brouberol)
[08:49:53] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[08:50:50] <wikibugs>	 (03PS3) 10Elukey: sre.hosts.provision: avoid Redfish calls before DHCP for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1078726 (https://phabricator.wikimedia.org/T365372)
[08:51:13] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1176.eqiad.wmnet with OS bullseye
[08:51:14] <wikibugs>	 (03CR) 10CI reject: [V:04-1] scap: Also exclude (my)sql from mwscript deprecation warning [puppet] - 10https://gerrit.wikimedia.org/r/1078900 (https://phabricator.wikimedia.org/T341553) (owner: 10Lucas Werkmeister (WMDE))
[08:52:00] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] "[nitpick] it looks like you missed the Bug footer on the commit message" [puppet] - 10https://gerrit.wikimedia.org/r/1078895 (owner: 10Slyngshede)
[08:52:08] <wikibugs>	 (03CR) 10Elukey: "I have isolated the code to reboot for readability, and also I made sure that the reboot happens after the BMC network settings are applie" [cookbooks] - 10https://gerrit.wikimedia.org/r/1078726 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey)
[08:52:51] <wikibugs>	 (03PS4) 10Elukey: sre.hosts.provision: avoid Redfish calls before DHCP for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1078726 (https://phabricator.wikimedia.org/T365372)
[08:53:18] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): scap: Also exclude (my)sql from mwscript deprecation warning [puppet] - 10https://gerrit.wikimedia.org/r/1078900 (https://phabricator.wikimedia.org/T341553)
[08:53:18] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): scap: Also exclude (my)sql from mwscript deprecation warning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1078900 (https://phabricator.wikimedia.org/T341553) (owner: 10Lucas Werkmeister (WMDE))
[08:53:38] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[08:54:01] <wikibugs>	 (03PS3) 10Slyngshede: P:trafficserver::backend remove CloudIDM. [puppet] - 10https://gerrit.wikimedia.org/r/1078895 (https://phabricator.wikimedia.org/T360795)
[08:54:02] <logmsgbot>	 !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[08:55:17] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] P:trafficserver::backend remove CloudIDM. [puppet] - 10https://gerrit.wikimedia.org/r/1078895 (https://phabricator.wikimedia.org/T360795) (owner: 10Slyngshede)
[08:56:17] <wikibugs>	 (03PS1) 10Kosta Harlan: dumps: Drop the globalblocks table dump [puppet] - 10https://gerrit.wikimedia.org/r/1078901 (https://phabricator.wikimedia.org/T376726)
[08:57:57] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[08:58:01] <wikibugs>	 (03CR) 10Kosta Harlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078901 (https://phabricator.wikimedia.org/T376726) (owner: 10Kosta Harlan)
[09:04:31] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1078726 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey)
[09:04:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:14:32] <wikibugs>	 (03CR) 10Elukey: [C:03+2] "Tested and it seems working nicely!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1078726 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey)
[09:18:45] <wikibugs>	 (03CR) 10Ladsgroup: dumps: Drop the globalblocks table dump (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1078901 (https://phabricator.wikimedia.org/T376726) (owner: 10Kosta Harlan)
[09:20:42] <wikibugs>	 (03PS1) 10Mhorsey: Release CampaignEvents to eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078907 (https://phabricator.wikimedia.org/T376786)
[09:20:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:27:12] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.provision: avoid Redfish calls before DHCP for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1078726 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey)
[09:32:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove ganeti2009/ganeti2010 from Ganeti role [puppet] - 10https://gerrit.wikimedia.org/r/1078660 (https://phabricator.wikimedia.org/T376594) (owner: 10Muehlenhoff)
[09:37:42] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10213457 (10MoritzMuehlenhoff)
[09:40:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudcephosd1026.eqiad.wmnet
[09:41:03] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch cloudcephosd1026 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1078909 (https://phabricator.wikimedia.org/T349619)
[09:42:27] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10213469 (10MoritzMuehlenhoff) Rollout of further nodes is put on hold until we get Redfish licences for the new servers.
[09:42:50] <Dreamy_Jazz>	 !log Started time limited MediaModertation scan on enwiki for 16hrs to catchup with monthly request limit - https://wikitech.wikimedia.org/wiki/MediaModeration
[09:42:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241009T1000)
[10:05:26] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1078684 (https://phabricator.wikimedia.org/T376719) (owner: 10Arturo Borrero Gonzalez)
[10:05:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:07:59] <jinxer-wm>	 FIRING: [2x] CertAlmostExpired: Certificate for service echostore:8082 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#echostore:8082 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[10:09:48] <wikibugs>	 (03PS9) 10Arturo Borrero Gonzalez: prometheus: add kernel-panic detector [puppet] - 10https://gerrit.wikimedia.org/r/1078684 (https://phabricator.wikimedia.org/T376719)
[10:09:57] <wikibugs>	 (03PS10) 10Arturo Borrero Gonzalez: prometheus: add kernel-panic detector [puppet] - 10https://gerrit.wikimedia.org/r/1078684 (https://phabricator.wikimedia.org/T376719)
[10:10:14] <wikibugs>	 (03PS11) 10Arturo Borrero Gonzalez: prometheus: add kernel-panic detector [puppet] - 10https://gerrit.wikimedia.org/r/1078684 (https://phabricator.wikimedia.org/T376719)
[10:10:43] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: prometheus: add kernel-panic detector (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1078684 (https://phabricator.wikimedia.org/T376719) (owner: 10Arturo Borrero Gonzalez)
[10:10:44] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078684 (https://phabricator.wikimedia.org/T376719) (owner: 10Arturo Borrero Gonzalez)
[10:11:09] <logmsgbot>	 !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1176.eqiad.wmnet with OS bullseye
[10:12:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch cloudcephosd1026 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1078909 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[10:12:55] <wikibugs>	 (03PS1) 10Kosta Harlan: dumps: stop running the dump_global_blocks job [puppet] - 10https://gerrit.wikimedia.org/r/1078913 (https://phabricator.wikimedia.org/T376726)
[10:15:23] <wikibugs>	 (03PS2) 10Kosta Harlan: dumps: Drop the globalblocks table dump [puppet] - 10https://gerrit.wikimedia.org/r/1078901 (https://phabricator.wikimedia.org/T376726)
[10:15:39] <wikibugs>	 (03PS2) 10Kosta Harlan: dumps: Stop running the dump_global_blocks job [puppet] - 10https://gerrit.wikimedia.org/r/1078913 (https://phabricator.wikimedia.org/T376726)
[10:15:46] <wikibugs>	 (03PS3) 10Kosta Harlan: dumps: Drop the globalblocks table dump [puppet] - 10https://gerrit.wikimedia.org/r/1078901 (https://phabricator.wikimedia.org/T376726)
[10:16:20] <wikibugs>	 (03CR) 10Kosta Harlan: dumps: Drop the globalblocks table dump (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1078901 (https://phabricator.wikimedia.org/T376726) (owner: 10Kosta Harlan)
[10:16:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudcephosd1026.eqiad.wmnet
[10:17:16] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+2] prometheus: add kernel-panic detector [puppet] - 10https://gerrit.wikimedia.org/r/1078684 (https://phabricator.wikimedia.org/T376719) (owner: 10Arturo Borrero Gonzalez)
[10:19:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10213566 (10phaultfinder)
[10:19:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:21:48] <wikibugs>	 (03CR) 10Elukey: [C:03+2] swift: avoid rate-limit for the Docker account [puppet] - 10https://gerrit.wikimedia.org/r/1078380 (https://phabricator.wikimedia.org/T376285) (owner: 10Elukey)
[10:26:37] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P69506 and previous config saved to /var/cache/conftool/dbconfig/20241009-102636-ladsgroup.json
[10:27:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudcephosd1027.eqiad.wmnet
[10:28:19] <elukey>	 !log roll restart swift-proxy on ms-fe* to pick up https://gerrit.wikimedia.org/r/c/operations/puppet/+/1078380
[10:28:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:23] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch cloudcephosd1027 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1078916 (https://phabricator.wikimedia.org/T349619)
[10:30:43] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10213583 (10elukey) Hi folks! Not sure what is special about the server, but in this case the Supermicro Network settings for the BMC can't be applied...
[10:32:14] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10213586 (10jcrespo) @elukey thanks, that is all I needed, some info on where we were and I wasn't aware of ongoing progress during my sabbatical, whic...
[10:34:31] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Split the permission to access Logstash from the cn=wmf and cn=nda groups - https://phabricator.wikimedia.org/T376790 (10MoritzMuehlenhoff) 03NEW
[10:34:32] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Split the permission to access Logstash from the cn=wmf and cn=nda groups - https://phabricator.wikimedia.org/T376790#10213600 (10MoritzMuehlenhoff) p:05Triage→03High
[10:35:40] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/sessionstore: sync
[10:36:24] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch cloudcephosd1027 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1078916 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[10:37:02] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add an hdfs_file type and provider [puppet] - 10https://gerrit.wikimedia.org/r/1074478 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis)
[10:38:01] <moritzm>	 btullis: you can merge my patch along
[10:41:02] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1077710 (owner: 10Klausman)
[10:41:42] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P69507 and previous config saved to /var/cache/conftool/dbconfig/20241009-104142-ladsgroup.json
[10:44:15] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/sessionstore: sync
[10:44:39] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1177.eqiad.wmnet with OS bullseye
[10:46:02] <wikibugs>	 (03PS2) 10Btullis: Switch cephosd1001 to use the nftables based firewall [puppet] - 10https://gerrit.wikimedia.org/r/1050330 (https://phabricator.wikimedia.org/T327259)
[10:48:00] <wikibugs>	 (03CR) 10Muehlenhoff: "You still need to define the setting for Envoy" [puppet] - 10https://gerrit.wikimedia.org/r/1050330 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[10:48:09] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release sessionstore/staging on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=sessionstore - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[10:49:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudcephosd1027.eqiad.wmnet
[10:50:50] <elukey>	 this is me --^
[10:51:41] <wikibugs>	 (03PS1) 10Elukey: services: update proxied port for kask [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078918 (https://phabricator.wikimedia.org/T363996)
[10:52:00] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service cloudidm2001-dev:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#cloudidm2001-dev:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[10:56:00] <wikibugs>	 (03CR) 10Elukey: [C:03+2] "Testing this since it is staging and already broken :D" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078918 (https://phabricator.wikimedia.org/T363996) (owner: 10Elukey)
[10:56:49] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P69511 and previous config saved to /var/cache/conftool/dbconfig/20241009-105647-ladsgroup.json
[11:00:05] <jouncebot>	 mvolz: Time to snap out of that daydream and deploy Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241009T1100).
[11:00:37] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/sessionstore: sync
[11:00:39] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/sessionstore: sync
[11:03:38] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: python_deploy::venv: transform into a define [puppet] - 10https://gerrit.wikimedia.org/r/1078707
[11:03:39] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: fastapi: Add define to run a fastapi application [puppet] - 10https://gerrit.wikimedia.org/r/1078708 (https://phabricator.wikimedia.org/T371782)
[11:03:39] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: profile::conftool: add web interface for requestctl [puppet] - 10https://gerrit.wikimedia.org/r/1078709 (https://phabricator.wikimedia.org/T371782)
[11:04:17] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/sessionstore: sync
[11:04:31] <wikibugs>	 (03CR) 10CI reject: [V:04-1] fastapi: Add define to run a fastapi application [puppet] - 10https://gerrit.wikimedia.org/r/1078708 (https://phabricator.wikimedia.org/T371782) (owner: 10Giuseppe Lavagetto)
[11:07:44] <jinxer-wm>	 FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment cert-manager in cert-manager at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[11:08:09] <jinxer-wm>	 RESOLVED: HelmReleaseBadStatus: Helm release sessionstore/staging on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=sessionstore - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[11:11:54] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P69513 and previous config saved to /var/cache/conftool/dbconfig/20241009-111154-ladsgroup.json
[11:12:29] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Timeout while retrieving the catalog from the Docker Registry - https://phabricator.wikimedia.org/T376285#10213709 (10elukey) Rolled out the swift proxy change, it seems that it has a solved the issue. Doing more tests before closing to...
[11:14:12] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: fastapi: Add define to run a fastapi application [puppet] - 10https://gerrit.wikimedia.org/r/1078708 (https://phabricator.wikimedia.org/T371782)
[11:14:12] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: profile::conftool: add web interface for requestctl [puppet] - 10https://gerrit.wikimedia.org/r/1078709 (https://phabricator.wikimedia.org/T371782)
[11:15:36] <wikibugs>	 (03CR) 10Ladsgroup: "This absents the timer but not the directories. That's fine if you want to keep old dumps. Whatever you decide." [puppet] - 10https://gerrit.wikimedia.org/r/1078913 (https://phabricator.wikimedia.org/T376726) (owner: 10Kosta Harlan)
[11:24:24] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/sessionstore: sync
[11:27:18] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4255/co" [puppet] - 10https://gerrit.wikimedia.org/r/1078707 (owner: 10Giuseppe Lavagetto)
[11:29:35] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM, surely it makes sense to convert it to a define so that we can use more than one per host. Thanks for the patch." [puppet] - 10https://gerrit.wikimedia.org/r/1078707 (owner: 10Giuseppe Lavagetto)
[11:30:13] <wikibugs>	 (03CR) 10Kosta Harlan: "I think that is probably OK? How soon after merging this one can we merge I428c67b7ed7b481bfbe084b0a6e3f1025f9e6d9d ?" [puppet] - 10https://gerrit.wikimedia.org/r/1078913 (https://phabricator.wikimedia.org/T376726) (owner: 10Kosta Harlan)
[11:37:09] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release sessionstore/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=sessionstore - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[11:38:21] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: team-wmcs: add kernel panic alerts [alerts] - 10https://gerrit.wikimedia.org/r/1078922 (https://phabricator.wikimedia.org/T376719)
[11:40:02] <wikibugs>	 (03PS1) 10Gmodena: dse-k8s-services: content_history: version bump image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078923 (https://phabricator.wikimedia.org/T368787)
[11:47:40] <wikibugs>	 (03PS1) 10Brouberol: cephosd: fix syntax of the dse-k8s-csi-cephfs caps [puppet] - 10https://gerrit.wikimedia.org/r/1078926 (https://phabricator.wikimedia.org/T376407)
[11:48:25] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Great, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1078926 (https://phabricator.wikimedia.org/T376407) (owner: 10Brouberol)
[11:48:34] <wikibugs>	 (03PS2) 10Brouberol: cephosd: fix syntax of the dse-k8s-csi-cephfs caps [puppet] - 10https://gerrit.wikimedia.org/r/1078926 (https://phabricator.wikimedia.org/T376407)
[11:49:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:51:10] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] cephosd: fix syntax of the dse-k8s-csi-cephfs caps [puppet] - 10https://gerrit.wikimedia.org/r/1078926 (https://phabricator.wikimedia.org/T376407) (owner: 10Brouberol)
[11:51:29] <wikibugs>	 (03CR) 10Jelto: [C:03+2] wikidata-query-gui: remove experimental endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078664 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto)
[11:51:59] <logmsgbot>	 !log kcvelaga@deploy2002 Started deploy [airflow-dags/analytics_product@b2c30ad]: T375153
[11:52:01] <jynus>	 !log start systemctl start wmf_auto_restart_routinator.service on rpki2003
[11:52:02] <stashbot>	 T375153: ETL pipeline for Automoderator daily monitoring metrics - https://phabricator.wikimedia.org/T375153
[11:52:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:54:30] <logmsgbot>	 !log kcvelaga@deploy2002 Finished deploy [airflow-dags/analytics_product@b2c30ad]: T375153 (duration: 02m 32s)
[11:54:55] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: wmf_auto_restart_routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:55:23] <wikibugs>	 (03Merged) 10jenkins-bot: wikidata-query-gui: remove experimental endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078664 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto)
[12:02:58] <wikibugs>	 (03PS2) 10JMeybohm: [stub] mwscript-k8s: Add concurrency limiting via poolcounter [puppet] - 10https://gerrit.wikimedia.org/r/1078796 (https://phabricator.wikimedia.org/T376795) (owner: 10Giuseppe Lavagetto)
[12:04:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [stub] mwscript-k8s: Add concurrency limiting via poolcounter [puppet] - 10https://gerrit.wikimedia.org/r/1078796 (https://phabricator.wikimedia.org/T376795) (owner: 10Giuseppe Lavagetto)
[12:04:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:07:25] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1074959 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis)
[12:11:16] <wikibugs>	 (03CR) 10Awight: [C:03+1] [config] Rename moved gadget name setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078463 (https://phabricator.wikimedia.org/T362771) (owner: 10WMDE-Fisch)
[12:11:56] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:15:01] <logmsgbot>	 !log jelto@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[12:15:15] <logmsgbot>	 !log jelto@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[12:15:46] <logmsgbot>	 !log jelto@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[12:16:04] <logmsgbot>	 !log jelto@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[12:18:59] <moritzm>	 !log installing initramfs-tools bugfix updates from Bookworm point release
[12:19:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:20:07] <wikibugs>	 (03CR) 10Milimetric: [C:03+1] "Just to say I approve of this change.  In our attempts to find users of various dumps, we haven't heard anyone speak up for this one." [puppet] - 10https://gerrit.wikimedia.org/r/1078901 (https://phabricator.wikimedia.org/T376726) (owner: 10Kosta Harlan)
[12:20:13] <wikibugs>	 (03PS1) 10Gmodena: dse-k8s-services: content_history: version bump image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078923 (https://phabricator.wikimedia.org/T368787)
[12:20:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:21:39] <wikibugs>	 (03PS1) 10JMeybohm: Bump cert-manager, cfssl-issuer and helm-state metrics resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078927 (https://phabricator.wikimedia.org/T376795)
[12:21:41] <wikibugs>	 (03PS1) 10JMeybohm: Align calico resource settings for codfw and eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078928 (https://phabricator.wikimedia.org/T376795)
[12:23:26] <logmsgbot>	 !log jelto@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[12:23:37] <wikibugs>	 (03PS2) 10Gmodena: dse-k8s-services: content_history: version bump image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078923 (https://phabricator.wikimedia.org/T368787)
[12:23:43] <logmsgbot>	 !log jelto@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[12:24:02] <logmsgbot>	 !log jelto@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'.
[12:24:14] <logmsgbot>	 !log jelto@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[12:24:36] <wikibugs>	 (03CR) 10Gmodena: dse-k8s-services: content_history: version bump image. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078923 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena)
[12:26:59] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] Bump cert-manager, cfssl-issuer and helm-state metrics resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078927 (https://phabricator.wikimedia.org/T376795) (owner: 10JMeybohm)
[12:27:24] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] Align calico resource settings for codfw and eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078928 (https://phabricator.wikimedia.org/T376795) (owner: 10JMeybohm)
[12:33:16] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.decommission for hosts rpki2002.codfw.wmnet
[12:33:57] <wikibugs>	 (03PS1) 10Kosta Harlan: QuickSurveys: Deploy Safety Survey with zero coverage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078929 (https://phabricator.wikimedia.org/T376517)
[12:34:33] <wikibugs>	 (03CR) 10Tiziano Fogli: ripeatlas: clean up resource defs after deletion (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1078702 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli)
[12:35:04] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078929 (https://phabricator.wikimedia.org/T376517) (owner: 10Kosta Harlan)
[12:36:09] <wikibugs>	 (03CR) 10Daimona Eaytoy: [C:03+1] "Noting that this will configure the extension to use the shared database in x1, so:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078774 (https://phabricator.wikimedia.org/T376747) (owner: 10Albertoleoncio)
[12:38:07] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox
[12:38:12] <wikibugs>	 (03PS2) 10JMeybohm: Bump cert-manager, cfssl-issuer and helm-state metrics resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078927 (https://phabricator.wikimedia.org/T376795)
[12:38:12] <wikibugs>	 (03PS2) 10JMeybohm: Align calico resource settings for codfw and eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078928 (https://phabricator.wikimedia.org/T376795)
[12:38:40] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 10 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078463 (https://phabricator.wikimedia.org/T362771) (owner: 10WMDE-Fisch)
[12:38:59] <wikibugs>	 (03Abandoned) 10CDanis: ferm: allow DNS traffic against k8s control planes [puppet] - 10https://gerrit.wikimedia.org/r/1078719 (https://phabricator.wikimedia.org/T344171) (owner: 10CDanis)
[12:41:13] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:41:16] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: rpki2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002"
[12:41:45] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: rpki2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002"
[12:41:45] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:41:45] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts rpki2002.codfw.wmnet
[12:43:43] <wikibugs>	 (03PS3) 10Tiziano Fogli: ripeatlas: clean up resource defs after deletion [puppet] - 10https://gerrit.wikimedia.org/r/1078702 (https://phabricator.wikimedia.org/T370506)
[12:45:10] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] ripeatlas: clean up resource defs after deletion [puppet] - 10https://gerrit.wikimedia.org/r/1078702 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli)
[12:45:34] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Mail: Lisa@wikipedia.org is receiving a large number of donor responses - https://phabricator.wikimedia.org/T375643#10213939 (10Aklapper) (Per T376798 I removed an image from this task.)
[12:47:28] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:51:38] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] Align calico resource settings for codfw and eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078928 (https://phabricator.wikimedia.org/T376795) (owner: 10JMeybohm)
[12:52:35] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] Bump cert-manager, cfssl-issuer and helm-state metrics resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078927 (https://phabricator.wikimedia.org/T376795) (owner: 10JMeybohm)
[12:53:05] <wikibugs>	 (03CR) 10Klausman: [C:03+2] hiera/modules: Add ML Lab machine roles and config [puppet] - 10https://gerrit.wikimedia.org/r/1077710 (owner: 10Klausman)
[12:54:42] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10213960 (10ayounsi) Phase 2 lgtm, one point though : you need to trunk the management vlan between the old and new switch for fasw to be reachable between step...
[12:57:17] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add some test secrets to the hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1074959 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis)
[12:58:57] <wikibugs>	 (03CR) 10Albertoleoncio: "Yep, that's right." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078774 (https://phabricator.wikimedia.org/T376747) (owner: 10Albertoleoncio)
[12:59:56] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudlb2004-dev.codfw.wmnet with OS bookworm
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, awight, and TheresNoTime: That opportune time for a UTC afternoon backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241009T1300).
[13:00:05] <jouncebot>	 Ammar, albertoleoncio, and kostajh: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:09] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10213969 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudlb2004-dev.codfw.wmnet with OS bookworm
[13:00:09] <kostajh>	 hi
[13:00:13] <albertoleoncio>	 hi
[13:00:31] <Lucas_WMDE>	 o/
[13:00:32] <kostajh>	 do you all mind if I go first, as I need to be away from the keyboard in ~25 minutes? 
[13:00:42] <Lucas_WMDE>	 sure, go ahead imho
[13:00:46] <albertoleoncio>	 sure
[13:00:49] <kostajh>	 thx
[13:01:09] <kostajh>	 starting then
[13:01:11] <Lucas_WMDE>	 (I’ll also be in a meeting in half an hour from now btw, so let’s see if we get through all the changes in time)
[13:01:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078929 (https://phabricator.wikimedia.org/T376517) (owner: 10Kosta Harlan)
[13:02:08] <wikibugs>	 (03CR) 10Daimona Eaytoy: [C:03+1] "Great, thanks for confirming ;)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078774 (https://phabricator.wikimedia.org/T376747) (owner: 10Albertoleoncio)
[13:02:14] <kostajh>	 ack, will move it along as quickly as I can
[13:02:35] <wikibugs>	 (03Merged) 10jenkins-bot: QuickSurveys: Deploy Safety Survey with zero coverage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078929 (https://phabricator.wikimedia.org/T376517) (owner: 10Kosta Harlan)
[13:03:38] <logmsgbot>	 !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1078929|QuickSurveys: Deploy Safety Survey with zero coverage (T376517)]]
[13:03:40] <stashbot>	 T376517: First test, then launch the new Safety Survey  - https://phabricator.wikimedia.org/T376517
[13:05:22] <wikibugs>	 (03PS1) 10Klausman: aptrepo: Add two missing packages to rocm61 repo [puppet] - 10https://gerrit.wikimedia.org/r/1078937 (https://phabricator.wikimedia.org/T376380)
[13:05:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:06:07] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1078929|QuickSurveys: Deploy Safety Survey with zero coverage (T376517)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:07:57] <kostajh>	 which option should I use with the WikimediaDebug browser extension? k8s-mwdebug?
[13:08:19] <cdanis>	 kostajh: yep :)
[13:08:26] <kostajh>	 thx
[13:08:39] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "I think the timer jobs are present on the replica machines, see https://puppet-compiler.wmflabs.org/output/1078752/4250/gerrit2003.wikimed" [puppet] - 10https://gerrit.wikimedia.org/r/1078752 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn)
[13:09:01] <Lucas_WMDE>	 the other ones should also work at the moment (`scap backport` deploys to all of them) but k8s-mwdebug is the one with a future ;)
[13:09:36] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Continuing with sync
[13:09:40] <kostajh>	 lgtm
[13:09:47] <wikibugs>	 (03CR) 10FNegri: team-wmcs: add kernel panic alerts (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1078922 (https://phabricator.wikimedia.org/T376719) (owner: 10Arturo Borrero Gonzalez)
[13:11:08] <Lucas_WMDE>	 Ammar: just checking, are you around? (once kostajh is done deploying)
[13:11:46] <Ammar>	 Lucas_WMDE yes
[13:11:56] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:12:02] <Lucas_WMDE>	 ok :)
[13:12:07] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host etherpad2002.codfw.wmnet
[13:12:13] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] aptrepo: Add two missing packages to rocm61 repo [puppet] - 10https://gerrit.wikimedia.org/r/1078937 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman)
[13:13:51] <wikibugs>	 (03CR) 10Klausman: [C:03+2] aptrepo: Add two missing packages to rocm61 repo [puppet] - 10https://gerrit.wikimedia.org/r/1078937 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman)
[13:14:15] <logmsgbot>	 !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1078929|QuickSurveys: Deploy Safety Survey with zero coverage (T376517)]] (duration: 10m 37s)
[13:14:18] <stashbot>	 T376517: First test, then launch the new Safety Survey  - https://phabricator.wikimedia.org/T376517
[13:14:23] <kostajh>	 ok, over to you Lucas_WMDE 
[13:14:24] <kostajh>	 thanks!
[13:14:28] <Lucas_WMDE>	 ok!
[13:14:30] <Lucas_WMDE>	 thank you!
[13:14:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078680 (https://phabricator.wikimedia.org/T376536) (owner: 10Ammarpad)
[13:14:52] <wikibugs>	 (03PS2) 10Slyngshede: Speed holes. [software/bitu] - 10https://gerrit.wikimedia.org/r/1078675
[13:15:15] <inflatador>	 Hello 0lly! We're getting some puppet errors on `datahubsearch` (opensearch errors). Based on T362429 , it seems this could be related to your team releasing a new curator pkg...can anyone take a look?  
[13:15:15] <stashbot>	 T362429: Investigate Puppet failures on datahubsearch hosts - https://phabricator.wikimedia.org/T362429
[13:15:25] <wikibugs>	 (03Merged) 10jenkins-bot: sdwiki: Add new logo and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078680 (https://phabricator.wikimedia.org/T376536) (owner: 10Ammarpad)
[13:15:32] <kostajh>	 hmm
[13:15:36] <inflatador>	 oops, wrong room
[13:15:38] <kostajh>	 so my change worked with mw debug
[13:15:52] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1078680|sdwiki: Add new logo and tagline (T376536)]]
[13:15:55] <stashbot>	 T376536: Request for change the sd.wikipedia logo - https://phabricator.wikimedia.org/T376536
[13:15:56] <kostajh>	 but now I get `Error: Module "ext.quicksurveys.lib" is not loaded` when I try to load the survey :/
[13:16:06] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host etherpad2002.codfw.wmnet
[13:16:39] <wikibugs>	 (03PS3) 10Slyngshede: Permissions: Cleanup code and reduce LDAP queries. [software/bitu] - 10https://gerrit.wikimedia.org/r/1078675
[13:16:56] <kostajh>	 ah, now it works 🤷
[13:17:40] <Lucas_WMDE>	 huh, ok
[13:17:41] <wikibugs>	 (03PS4) 10Slyngshede: Permissions: Cleanup code and reduce LDAP queries. [software/bitu] - 10https://gerrit.wikimedia.org/r/1078675
[13:17:55] <Lucas_WMDE>	 I guess you need to be on a purged page or something?
[13:17:56] <Lucas_WMDE>	 not sure
[13:18:07] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host etherpad1004.eqiad.wmnet
[13:18:11] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, ammarpad: Backport for [[gerrit:1078680|sdwiki: Add new logo and tagline (T376536)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:18:30] <Lucas_WMDE>	 Ammar: please test!
[13:18:49] <Ammar>	 OK
[13:19:00] <Lucas_WMDE>	 I definitely see a difference, but I can’t say if it’s right or not ^^
[13:20:22] <wikibugs>	 (03CR) 10Btullis: "I'll have a go at this." [puppet] - 10https://gerrit.wikimedia.org/r/1076910 (owner: 10Hashar)
[13:22:06] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host etherpad1004.eqiad.wmnet
[13:22:38] <Ammar>	 I tried with purging but really can't get the new logo
[13:22:49] <Lucas_WMDE>	 did you try ctrl+f5?
[13:22:55] <Lucas_WMDE>	 that’s what I had to do
[13:23:08] <Lucas_WMDE>	 (and with WikimediaDebug enabled, of course
[13:23:08] <Ammar>	 This is the new logo https://commons.wikimedia.org/wiki/File:Wikipedia-logo-v2-sd.svg
[13:23:45] <Lucas_WMDE>	 yes, that matches what I see at https://sd.wikipedia.org/wiki/%D9%85%D9%8F%DA%A9_%D8%B5%D9%81%D8%AD%D9%88?useskin=vector
[13:23:50] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM idm-test1001.wikimedia.org
[13:25:14] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Align calico resource settings for codfw and eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078928 (https://phabricator.wikimedia.org/T376795) (owner: 10JMeybohm)
[13:25:17] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Bump cert-manager, cfssl-issuer and helm-state metrics resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078927 (https://phabricator.wikimedia.org/T376795) (owner: 10JMeybohm)
[13:25:31] <wikibugs>	 (03CR) 10Hashar: "Done by I393e27133e0ad7bb414491e76fa959368c14be86" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1069226 (https://phabricator.wikimedia.org/T372485) (owner: 10Hashar)
[13:26:05] <wikibugs>	 (03PS1) 10Btullis: Allow underscores, hyphens, and dots in hdfs_file names [puppet] - 10https://gerrit.wikimedia.org/r/1078940 (https://phabricator.wikimedia.org/T323692)
[13:26:31] <Lucas_WMDE>	 Ammar: did you try force-reloading the page? (https://en.wikipedia.org/wiki/Help:Purge#Purge_local_browser_cache has some more keyboard shortcuts)
[13:26:43] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4256/co" [puppet] - 10https://gerrit.wikimedia.org/r/1074461 (https://phabricator.wikimedia.org/T375278) (owner: 10Hashar)
[13:27:08] <Ammar>	 @Lucas_WMDE Ok yes it works now. (yes I am using WikimediaDebug)
[13:27:12] <Lucas_WMDE>	 ok!
[13:27:13] <hashar>	 :]
[13:27:15] <Lucas_WMDE>	 and does it look correct?
[13:27:41] <Lucas_WMDE>	 (I’m not sure if that’s implied by “it works now” so I just want to make sure ^^)
[13:27:44] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idm-test1001.wikimedia.org
[13:28:13] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM idm1001.wikimedia.org
[13:28:22] <wikibugs>	 (03CR) 10Elukey: [C:03+2] "Already fixed by Riccardo!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1069226 (https://phabricator.wikimedia.org/T372485) (owner: 10Hashar)
[13:29:24] <wikibugs>	 (03Merged) 10jenkins-bot: Bump cert-manager, cfssl-issuer and helm-state metrics resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078927 (https://phabricator.wikimedia.org/T376795) (owner: 10JMeybohm)
[13:29:25] <wikibugs>	 (03Merged) 10jenkins-bot: Align calico resource settings for codfw and eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078928 (https://phabricator.wikimedia.org/T376795) (owner: 10JMeybohm)
[13:30:35] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idm1001.wikimedia.org
[13:30:45] <Lucas_WMDE>	 I guess I’ll continue with the deployment…
[13:30:49] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, ammarpad: Continuing with sync
[13:30:54] <Ammar>	 Lucas_WMDE sorry. I am seeing the new logo correctly. It works
[13:30:58] <Lucas_WMDE>	 ok, great!
[13:31:00] <Ammar>	 You can proceed
[13:31:03] <Lucas_WMDE>	 thanks!
[13:31:05] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM idm2001.wikimedia.org
[13:31:49] <wikibugs>	 (03CR) 10Xcollazo: [C:03+1] dumps: Stop running the dump_global_blocks job [puppet] - 10https://gerrit.wikimedia.org/r/1078913 (https://phabricator.wikimedia.org/T376726) (owner: 10Kosta Harlan)
[13:31:50] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+1] "lgtm, let me know when this should be deployed" [puppet] - 10https://gerrit.wikimedia.org/r/1074461 (https://phabricator.wikimedia.org/T375278) (owner: 10Hashar)
[13:32:09] <wikibugs>	 (03PS1) 10Klausman: aptrepo: Add more missing packages to the rocm61 import [puppet] - 10https://gerrit.wikimedia.org/r/1078941 (https://phabricator.wikimedia.org/T376380)
[13:32:09] <jinxer-wm>	 RESOLVED: HelmReleaseBadStatus: Helm release sessionstore/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=sessionstore - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[13:32:27] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host gerrit2003.wikimedia.org
[13:32:35] <wikibugs>	 (03Abandoned) 10Hashar: Add some HIDPI Wikivoyage logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529464 (https://phabricator.wikimedia.org/T230114) (owner: 10Jc86035)
[13:32:37] <wikibugs>	 (03CR) 10Klausman: [C:03+2] aptrepo: Add more missing packages to the rocm61 import [puppet] - 10https://gerrit.wikimedia.org/r/1078941 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman)
[13:33:10] <wikibugs>	 06SRE, 06Data-Engineering, 06Data-Platform, 10Dumps-Generation, and 4 others: Hide autoblocks from the globalblocks table database dump - https://phabricator.wikimedia.org/T376726#10214033 (10xcollazo) As per [[ https://wikimedia.slack.com/archives/CTFK3B423/p1728413707760419 | slack discussion ]], noting...
[13:33:28] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idm2001.wikimedia.org
[13:34:18] <Ammar>	 Lucas_WMDE Thank you
[13:34:39] <Lucas_WMDE>	 np :)
[13:35:26] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1078680|sdwiki: Add new logo and tagline (T376536)]] (duration: 19m 34s)
[13:35:29] <stashbot>	 T376536: Request for change the sd.wikipedia logo - https://phabricator.wikimedia.org/T376536
[13:35:54] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.7 point update - https://phabricator.wikimedia.org/T373783#10214086 (10MoritzMuehlenhoff)
[13:36:06] <Lucas_WMDE>	 is anyone else around who can deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1078774 for albertoleoncio?
[13:36:13] <Lucas_WMDE>	 I’m in a meeting now so I’d prefer not to deploy in parallel
[13:36:16] <albertoleoncio>	 please... =D
[13:37:05] <wikibugs>	 (03Abandoned) 10Hashar: Update HD logo for wikisource using default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529175 (owner: 10Viztor)
[13:37:17] <Lucas_WMDE>	 oh, damn, and I need to purge for Ammar 
[13:37:17] <Lucas_WMDE>	 one sec
[13:39:05] <Lucas_WMDE>	 !log lucaswerkmeister-wmde@deploy2002 $ printf 'https://en.wikipedia.org/static/images/%s\n' 'project-logos/sdwiki.png' 'project-logos/sdwiki-1.5x.png' 'project-logos/sdwiki-2x.png' 'mobile/copyright/wikipedia-wordmark-sd.svg' 'mobile/copyright/wikipedia-tagline-sd.svg' | mwscript-k8s --attach -- purgeList.php # T376536
[13:39:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:39:22] <Lucas_WMDE>	 took a bit longer than usual but done
[13:39:29] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum
[13:39:34] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] jenkins: dedupe apt::repository for thirdparty/ci [puppet] - 10https://gerrit.wikimedia.org/r/1074461 (https://phabricator.wikimedia.org/T375278) (owner: 10Hashar)
[13:39:43] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM idp-test1004.wikimedia.org
[13:39:48] <wikibugs>	 (03Merged) 10jenkins-bot: tox: only install flake8 when running flake8 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1069226 (https://phabricator.wikimedia.org/T372485) (owner: 10Hashar)
[13:40:06] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix /etc/issue for config-master [puppet] - 10https://gerrit.wikimedia.org/r/1078942
[13:40:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078774 (https://phabricator.wikimedia.org/T376747) (owner: 10Albertoleoncio)
[13:40:30] <Lucas_WMDE>	 alright, I’ll do the deploy for albertoleoncio 
[13:40:30] <wikibugs>	 (03PS2) 10Muehlenhoff: Fix /etc/issue for config-master [puppet] - 10https://gerrit.wikimedia.org/r/1078942
[13:40:37] <Lucas_WMDE>	 might just be a bit slower than usual ^^
[13:40:44] <Lucas_WMDE>	 but should be doable in the remaining 20 minutes
[13:40:45] <Lucas_WMDE>	 jouncebot: next
[13:40:46] <jouncebot>	 In 0 hour(s) and 19 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241009T1400)
[13:41:08] <wikibugs>	 (03Merged) 10jenkins-bot: [brwikimedia] Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078774 (https://phabricator.wikimedia.org/T376747) (owner: 10Albertoleoncio)
[13:41:24] <logmsgbot>	 !log brouberol@cumin1002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-flink-eqiad cluster: Roll restart of jvm daemons.
[13:41:33] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1078774|[brwikimedia] Enable the CampaignEvents extension (T376747)]]
[13:41:36] <stashbot>	 T376747: Enable CampaignEvents Extension on br.wikimedia - https://phabricator.wikimedia.org/T376747
[13:41:59] <Lucas_WMDE>	 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1077417 would also be nice to deploy but probably won’t happen this window
[13:42:06] <wikibugs>	 (03CR) 10Xcollazo: "Looks like the following also needs to be removed:" [puppet] - 10https://gerrit.wikimedia.org/r/1078901 (https://phabricator.wikimedia.org/T376726) (owner: 10Kosta Harlan)
[13:42:14] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudcephosd1028.eqiad.wmnet
[13:42:43] <logmsgbot>	 !log brouberol@cumin1002 END (ERROR) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=97) for Zookeeper A:zookeeper-flink-eqiad cluster: Roll restart of jvm daemons.
[13:43:06] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T367856)', diff saved to https://phabricator.wikimedia.org/P69516 and previous config saved to /var/cache/conftool/dbconfig/20241009-134305-ladsgroup.json
[13:43:07] <wikibugs>	 (03PS1) 10Elukey: kask: use if instead of with in _config.yaml to skip tls [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078944 (https://phabricator.wikimedia.org/T376766)
[13:43:09] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[13:43:11] <wikibugs>	 (03CR) 10Bking: [C:03+1] Allow underscores, hyphens, and dots in hdfs_file names [puppet] - 10https://gerrit.wikimedia.org/r/1078940 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis)
[13:43:40] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Allow underscores, hyphens, and dots in hdfs_file names [puppet] - 10https://gerrit.wikimedia.org/r/1078940 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis)
[13:43:42] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idp-test1004.wikimedia.org
[13:43:45] <wikibugs>	 (03PS1) 10Klausman: aptrepo: Add more missing packages to the rocm61 import, pt2 [puppet] - 10https://gerrit.wikimedia.org/r/1078943 (https://phabricator.wikimedia.org/T37638)
[13:43:52] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 albertoleoncio, lucaswerkmeister-wmde: Backport for [[gerrit:1078774|[brwikimedia] Enable the CampaignEvents extension (T376747)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:44:03] <wikibugs>	 (03CR) 10Klausman: [C:03+2] aptrepo: Add more missing packages to the rocm61 import, pt2 [puppet] - 10https://gerrit.wikimedia.org/r/1078943 (https://phabricator.wikimedia.org/T37638) (owner: 10Klausman)
[13:44:03] <albertoleoncio>	 Looks good here
[13:44:03] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM idp1004.wikimedia.org
[13:44:03] <Lucas_WMDE>	 albertoleoncio: can you test using WikimediaDebug?
[13:44:06] <Lucas_WMDE>	 ok!
[13:44:08] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 albertoleoncio, lucaswerkmeister-wmde: Continuing with sync
[13:44:35] <wikibugs>	 (03PS2) 10Elukey: kask: use if instead of with in _config.yaml to skip tls [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078944 (https://phabricator.wikimedia.org/T376766)
[13:44:52] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch cloudcephosd1028 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1078945 (https://phabricator.wikimedia.org/T349619)
[13:44:54] <Lucas_WMDE>	 yup, https://br.wikimedia.org/wiki/Especial:AllEvents definitely shows a nonzero amount of events
[13:44:57] <logmsgbot>	 !log brouberol@cumin1002 START - Cookbook sre.hosts.reboot-single for host flink-zk1001.eqiad.wmnet
[13:45:00] <logmsgbot>	 !log brouberol@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host flink-zk1001.eqiad.wmnet
[13:45:21] <logmsgbot>	 !log brouberol@cumin1002 START - Cookbook sre.hosts.reboot-single for host flink-zk1001.eqiad.wmnet
[13:45:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch cloudcephosd1028 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1078945 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[13:46:00] <moritzm>	 klausman: I'll merge your patch alomg
[13:46:04] <klausman>	 ty!
[13:47:05] <moritzm>	 merged
[13:48:00] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idp1004.wikimedia.org
[13:48:26] <wikibugs>	 (03PS1) 10Slyngshede: IDP: Failover IDP service to eqiad. [dns] - 10https://gerrit.wikimedia.org/r/1078946
[13:48:37] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1078774|[brwikimedia] Enable the CampaignEvents extension (T376747)]] (duration: 07m 04s)
[13:48:40] <stashbot>	 T376747: Enable CampaignEvents Extension on br.wikimedia - https://phabricator.wikimedia.org/T376747
[13:48:43] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM idp-test2004.wikimedia.org
[13:48:55] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[13:48:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:49:06] <logmsgbot>	 !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flink-zk1001.eqiad.wmnet
[13:49:16] <hnowlan>	 jouncebot: nowandnext
[13:49:16] <jouncebot>	 For the next 0 hour(s) and 10 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241009T1300)
[13:49:16] <jouncebot>	 In 0 hour(s) and 10 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241009T1400)
[13:49:31] <albertoleoncio>	 Lucas_WMDE: Thanks! =D
[13:49:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudcephosd1028.eqiad.wmnet
[13:49:39] <Lucas_WMDE>	 np ^^
[13:50:11] <logmsgbot>	 !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on backup[1010-1011].eqiad.wmnet with reason: T376800
[13:50:25] <logmsgbot>	 !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on backup[1010-1011].eqiad.wmnet with reason: T376800
[13:50:44] <logmsgbot>	 !log brouberol@cumin1002 START - Cookbook sre.hosts.reboot-single for host flink-zk1003.eqiad.wmnet
[13:50:57] <wikibugs>	 (03CR) 10Elukey: [C:04-1] "Nope this is not the correct approach, it should already be working." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078944 (https://phabricator.wikimedia.org/T376766) (owner: 10Elukey)
[13:51:11] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idp-test2004.wikimedia.org
[13:51:24] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM idp-test2005.wikimedia.org
[13:51:57] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'.
[13:52:12] <logmsgbot>	 !log brouberol@cumin1002 START - Cookbook sre.hosts.reboot-single for host zookeeper-test1002.eqiad.wmnet
[13:52:45] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta
[13:52:56] <jinxer-wm>	 FIRING: RdfStreamingUpdaterFlinkJobUnstable: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[13:52:59] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox
[13:52:59] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns1004.wikimedia.org
[13:53:00] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[13:53:19] <logmsgbot>	 !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'.
[13:54:21] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: team-wmcs: add kernel panic alerts [alerts] - 10https://gerrit.wikimedia.org/r/1078922 (https://phabricator.wikimedia.org/T376719)
[13:54:30] <logmsgbot>	 !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flink-zk1003.eqiad.wmnet
[13:55:19] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idp-test2005.wikimedia.org
[13:56:07] <logmsgbot>	 !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host zookeeper-test1002.eqiad.wmnet
[13:57:21] <wikibugs>	 (03PS3) 10Elukey: services: skip the kask's tls config in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078944 (https://phabricator.wikimedia.org/T376766)
[13:57:39] <wikibugs>	 (03PS1) 10Muehlenhoff: Add members of platform-engineering to deployers [puppet] - 10https://gerrit.wikimedia.org/r/1078948 (https://phabricator.wikimedia.org/T376808)
[13:57:44] <jinxer-wm>	 RESOLVED: [2x] KubernetesDeploymentUnavailableReplicas: Deployment cert-manager in cert-manager at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[13:57:45] <jinxer-wm>	 RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUns
[13:57:47] <logmsgbot>	 !log brouberol@cumin1002 START - Cookbook sre.hosts.reboot-single for host flink-zk1002.eqiad.wmnet
[13:58:02] <jinxer-wm>	 RESOLVED: RdfStreamingUpdaterFlinkJobUnstable: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[13:58:09] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Phase out platform-engineering POSIX group - https://phabricator.wikimedia.org/T376808 (10MoritzMuehlenhoff) 03NEW
[13:58:13] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P69517 and previous config saved to /var/cache/conftool/dbconfig/20241009-135812-ladsgroup.json
[13:58:33] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] IDP: Failover IDP service to eqiad. [dns] - 10https://gerrit.wikimedia.org/r/1078946 (owner: 10Slyngshede)
[13:58:45] <wikibugs>	 (03CR) 10Ladsgroup: "In paper, half an hour but we usually wait longer (for days) to make sure all hosts that were shut down or had puppet disabled get the upd" [puppet] - 10https://gerrit.wikimedia.org/r/1078913 (https://phabricator.wikimedia.org/T376726) (owner: 10Kosta Harlan)
[14:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241009T1400)
[14:00:25] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] api-gateway: add REST gateway Lua CSP handler (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321) (owner: 10Hnowlan)
[14:00:28] <wikibugs>	 (03CR) 10Kosta Harlan: "this patch is all we need, yes." [puppet] - 10https://gerrit.wikimedia.org/r/1078913 (https://phabricator.wikimedia.org/T376726) (owner: 10Kosta Harlan)
[14:01:04] * James_F waves.
[14:01:31] <logmsgbot>	 !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flink-zk1002.eqiad.wmnet
[14:01:38] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[14:02:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet
[14:02:38] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[14:03:02] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-09-24-145528 to 2024-10-08-175830 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078949 (https://phabricator.wikimedia.org/T347086)
[14:03:08] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2024-09-24-221243 to 2024-10-08-175510 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078950 (https://phabricator.wikimedia.org/T347086)
[14:03:10] <wikibugs>	 (03PS3) 10Kosta Harlan: dumps: Stop running the dump_global_blocks job [puppet] - 10https://gerrit.wikimedia.org/r/1078913 (https://phabricator.wikimedia.org/T376726)
[14:03:13] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Enable Wikidata dereferencing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078951 (https://phabricator.wikimedia.org/T370072)
[14:03:14] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] dumps: Stop running the dump_global_blocks job [puppet] - 10https://gerrit.wikimedia.org/r/1078913 (https://phabricator.wikimedia.org/T376726) (owner: 10Kosta Harlan)
[14:03:16] <wikibugs>	 (03CR) 10Ladsgroup: [V:03+2 C:03+2] dumps: Stop running the dump_global_blocks job [puppet] - 10https://gerrit.wikimedia.org/r/1078913 (https://phabricator.wikimedia.org/T376726) (owner: 10Kosta Harlan)
[14:03:30] <wikibugs>	 (03PS1) 10Klausman: aptrepo: Add more missing packages to the rocm61 import, pt3 [puppet] - 10https://gerrit.wikimedia.org/r/1078952 (https://phabricator.wikimedia.org/T37638)
[14:03:55] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM idp2004.wikimedia.org
[14:04:20] <jinxer-wm>	 FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=35&orgId=1&from=now-6M&to=now - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh
[14:04:57] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host backup1012.eqiad.wmnet with OS bookworm
[14:05:31] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2024-09-24-145528 to 2024-10-08-175830 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078949 (https://phabricator.wikimedia.org/T347086) (owner: 10Jforrester)
[14:05:49] <wikibugs>	 (03CR) 10Klausman: [C:03+2] aptrepo: Add more missing packages to the rocm61 import, pt3 [puppet] - 10https://gerrit.wikimedia.org/r/1078952 (https://phabricator.wikimedia.org/T37638) (owner: 10Klausman)
[14:05:58] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[14:06:00] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet
[14:06:20] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: team-wmcs: add kernel panic alerts [alerts] - 10https://gerrit.wikimedia.org/r/1078922 (https://phabricator.wikimedia.org/T376719)
[14:06:25] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2024-09-24-145528 to 2024-10-08-175830 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078949 (https://phabricator.wikimedia.org/T347086) (owner: 10Jforrester)
[14:06:38] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10214438 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host backup1012.eqiad.wmnet with OS bookworm
[14:06:40] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: team-wmcs: add kernel panic alerts (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1078922 (https://phabricator.wikimedia.org/T376719) (owner: 10Arturo Borrero Gonzalez)
[14:06:47] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release kube-system/calico on k8s@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[14:06:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:06:58] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns1004.wikimedia.org
[14:07:34] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10214450 (10elukey) This is the current error:  ` Applying Network changes to the BMC. Error while configuring BIOS or mgmt interface: PATCH https://10...
[14:07:54] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idp2004.wikimedia.org
[14:07:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet
[14:07:59] <jinxer-wm>	 FIRING: [2x] CertAlmostExpired: Certificate for service echostore:8082 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#echostore:8082 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[14:08:51] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:08:53] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudlb2004-dev to codfw - jhancock@cumin2002"
[14:08:58] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudlb2004-dev to codfw - jhancock@cumin2002"
[14:08:58] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:09:17] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1078948 (https://phabricator.wikimedia.org/T376808) (owner: 10Muehlenhoff)
[14:09:19] <wikibugs>	 (03CR) 10FNegri: [C:03+1] "LGTM, we can refine further when we have some real-world examples." [alerts] - 10https://gerrit.wikimedia.org/r/1078922 (https://phabricator.wikimedia.org/T376719) (owner: 10Arturo Borrero Gonzalez)
[14:09:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2033.codfw.wmnet
[14:09:26] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:09:45] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:10:01] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[14:11:03] <moritzm>	 !log installing Apache security updates
[14:11:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:08] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "TIL profile::base::production::role_description" [puppet] - 10https://gerrit.wikimedia.org/r/1078942 (owner: 10Muehlenhoff)
[14:11:22] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[14:11:27] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[14:11:38] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Open a new WMA-core mailing list - https://phabricator.wikimedia.org/T37638#10214465 (10Ladsgroup) @klausman Hi, Are you sure the ticket your connecting your patches to is the correct one? First patch was fine, but this is the second patch.
[14:11:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2033.codfw.wmnet
[14:12:00] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet
[14:12:23] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[14:12:37] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Open a new WMA-core mailing list - https://phabricator.wikimedia.org/T37638#10214477 (10klausman) >>! In T37638#10214464, @Ladsgroup wrote: > @klausman Hi, Are you sure the ticket your connecting your patches to is the correct one? First patch was fin...
[14:13:08] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2024-09-24-221243 to 2024-10-08-175510 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078950 (https://phabricator.wikimedia.org/T347086) (owner: 10Jforrester)
[14:13:20] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P69519 and previous config saved to /var/cache/conftool/dbconfig/20241009-141319-ladsgroup.json
[14:13:26] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2033.codfw.wmnet
[14:14:35] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2024-09-24-221243 to 2024-10-08-175510 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078950 (https://phabricator.wikimedia.org/T347086) (owner: 10Jforrester)
[14:14:42] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+2] team-wmcs: add kernel panic alerts [alerts] - 10https://gerrit.wikimedia.org/r/1078922 (https://phabricator.wikimedia.org/T376719) (owner: 10Arturo Borrero Gonzalez)
[14:14:53] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[14:16:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Fix /etc/issue for config-master [puppet] - 10https://gerrit.wikimedia.org/r/1078942 (owner: 10Muehlenhoff)
[14:16:50] <wikibugs>	 (03PS1) 10Ssingh: P:ntp: increase check and retry interval [puppet] - 10https://gerrit.wikimedia.org/r/1078953
[14:17:10] <wikibugs>	 (03PS2) 10Ssingh: P:ntp: increase check_interval [puppet] - 10https://gerrit.wikimedia.org/r/1078953
[14:17:28] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:18:18] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4258/console" [puppet] - 10https://gerrit.wikimedia.org/r/1078953 (owner: 10Ssingh)
[14:18:23] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:18:29] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and A:wikidough
[14:18:36] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudlb2004-dev to codfw - jhancock@cumin2002"
[14:18:43] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudlb2004-dev to codfw - jhancock@cumin2002"
[14:18:43] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:18:50] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[14:19:21] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] services: skip the kask's tls config in staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078944 (https://phabricator.wikimedia.org/T376766) (owner: 10Elukey)
[14:19:52] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:03+2] P:ntp: increase check_interval [puppet] - 10https://gerrit.wikimedia.org/r/1078953 (owner: 10Ssingh)
[14:19:57] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudlb2004-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTARTand with Dell SCP reboot policy FORCED
[14:20:07] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudlb2004-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTARTand with Dell SCP reboot policy FORCED
[14:20:09] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[14:20:12] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudlb2004-dev.codfw.wmnet with OS bookworm
[14:20:25] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10214504 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudlb2004-dev.codfw.wmnet with OS bookworm executed w...
[14:20:25] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[14:20:28] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudlb2004-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTARTand with Dell SCP reboot policy FORCED
[14:20:39] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudlb2004-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTARTand with Dell SCP reboot policy FORCED
[14:21:05] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudlb2004-dev
[14:21:11] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host cloudlb2004-dev
[14:21:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2033.codfw.wmnet
[14:21:22] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[14:21:45] <wikibugs>	 (03PS4) 10Elukey: services: skip the kask's tls config in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078944 (https://phabricator.wikimedia.org/T376766)
[14:21:47] <sukhe>	 !log sudo cumin 'O:alerting_host' 'run-puppet-agent'
[14:21:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:21:49] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: wmcs: declare prometheus::node_kernel_panic in profile::base::cloud_production [puppet] - 10https://gerrit.wikimedia.org/r/1078954 (https://phabricator.wikimedia.org/T376719)
[14:21:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:21:58] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org
[14:22:04] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1012.eqiad.wmnet with reason: host reimage
[14:22:56] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudlb2004-dev
[14:23:01] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078954 (https://phabricator.wikimedia.org/T376719) (owner: 10Arturo Borrero Gonzalez)
[14:23:04] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host cloudlb2004-dev
[14:23:40] <moritzm>	 !log failover master for ganeti/routed to ganeti2033
[14:23:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:23:44] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wmcs: declare prometheus::node_kernel_panic in profile::base::cloud_production [puppet] - 10https://gerrit.wikimedia.org/r/1078954 (https://phabricator.wikimedia.org/T376719) (owner: 10Arturo Borrero Gonzalez)
[14:24:04] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling for reclone (T375652)', diff saved to https://phabricator.wikimedia.org/P69520 and previous config saved to /var/cache/conftool/dbconfig/20241009-142404-ladsgroup.json
[14:24:07] <stashbot>	 T375652: Wikimedia\Rdbms\DBQueryError: Error 1062: Duplicate entry '1' for key 'PRIMARY' Function: MediaWiki\CheckUser\Services\CheckUserLogService::addLogEntry - https://phabricator.wikimedia.org/T375652
[14:24:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10214546 (10phaultfinder)
[14:24:58] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Yeah, I believe this is the right way to go, especially for non-staging where we need to leave certs.cassandra intact (which is why I was " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078944 (https://phabricator.wikimedia.org/T376766) (owner: 10Elukey)
[14:25:19] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10214548 (10elukey) I checked the firmware version of the BMC and I got: `'Oem': {'Supermicro': {'UniqueFilename': 'BMC_X12AST2600-ROT-5201MS_20221105_...
[14:25:22] <wikibugs>	 (03PS1) 10JMeybohm: Bump cert-manager, cfssl-issuer and helm-state metrics resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078957 (https://phabricator.wikimedia.org/T376795)
[14:25:39] <wikibugs>	 (03CR) 10Elukey: [C:03+2] services: skip the kask's tls config in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078944 (https://phabricator.wikimedia.org/T376766) (owner: 10Elukey)
[14:27:56] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1012.eqiad.wmnet with reason: host reimage
[14:28:27] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T367856)', diff saved to https://phabricator.wikimedia.org/P69521 and previous config saved to /var/cache/conftool/dbconfig/20241009-142826-ladsgroup.json
[14:28:29] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on db1209.eqiad.wmnet with reason: Maintenance
[14:28:29] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[14:28:42] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 12:00:00 on db1209.eqiad.wmnet with reason: Maintenance
[14:28:49] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1209 (T367856)', diff saved to https://phabricator.wikimedia.org/P69522 and previous config saved to /var/cache/conftool/dbconfig/20241009-142848-ladsgroup.json
[14:29:15] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.mysql.clone of db1198.eqiad.wmnet onto db1157.eqiad.wmnet
[14:30:02] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[14:30:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet
[14:31:13] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job wikidough in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:31:26] <sukhe>	 ^ reboots, expected
[14:31:44] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/sessionstore: sync
[14:31:55] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/sessionstore: sync
[14:32:22] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:33:48] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Bump cert-manager, cfssl-issuer and helm-state metrics resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078957 (https://phabricator.wikimedia.org/T376795) (owner: 10JMeybohm)
[14:34:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet
[14:34:34] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: wmcs: declare prometheus::node_kernel_panic in profile::base::cloud_production [puppet] - 10https://gerrit.wikimedia.org/r/1078954 (https://phabricator.wikimedia.org/T376719)
[14:34:39] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078954 (https://phabricator.wikimedia.org/T376719) (owner: 10Arturo Borrero Gonzalez)
[14:35:13] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org
[14:36:13] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:36:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet
[14:39:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet
[14:39:44] <wikibugs>	 (03PS5) 10Herron: add links to SLOs migrated to pyrra [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1077966 (https://phabricator.wikimedia.org/T302995)
[14:39:54] <wikibugs>	 (03CR) 10Herron: [V:03+2 C:03+2] add links to SLOs migrated to pyrra [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1077966 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[14:42:13] <wikibugs>	 (03PS1) 10Elukey: sre.hosts.provision: warn when the BMC firmware is old for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1078961 (https://phabricator.wikimedia.org/T365372)
[14:42:28] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:42:39] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "Overall LGTM.  One nit inline, not 100% essential.  I'll work on setting the cloudsw side up to match the addresses used here." [puppet] - 10https://gerrit.wikimedia.org/r/1077712 (https://phabricator.wikimedia.org/T374716) (owner: 10Arturo Borrero Gonzalez)
[14:43:23] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudlb2004-dev
[14:43:32] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host cloudlb2004-dev
[14:43:42] <wikibugs>	 (03PS1) 10Muehlenhoff: Add jtweed to LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/1078962 (https://phabricator.wikimedia.org/T376777)
[14:44:05] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[14:44:25] <logmsgbot>	 !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[14:44:41] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[14:44:53] <logmsgbot>	 !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[14:44:55] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum
[14:44:58] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'.
[14:45:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add jtweed to LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/1078962 (https://phabricator.wikimedia.org/T376777) (owner: 10Muehlenhoff)
[14:45:36] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[14:45:38] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudlb2004-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTARTand with Dell SCP reboot policy FORCED
[14:46:43] <wikibugs>	 (03PS2) 10Elukey: sre.hosts.provision: warn when the BMC firmware is old for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1078961 (https://phabricator.wikimedia.org/T365372)
[14:47:05] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[14:47:24] <logmsgbot>	 !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[14:47:29] <wikibugs>	 (03PS1) 10Scott French: echostore: pilot service mesh support in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078964 (https://phabricator.wikimedia.org/T376766)
[14:47:30] <logmsgbot>	 !log brouberol@cumin1002 START - Cookbook sre.ceph.roll-restart-reboot-server rolling restart_daemons on P{cephosd1001*} and (A:cephosd)
[14:47:43] <wikibugs>	 (03CR) 10Elukey: "Tested on backup1012, worked nicely." [cookbooks] - 10https://gerrit.wikimedia.org/r/1078961 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey)
[14:47:51] <logmsgbot>	 !log brouberol@cumin1002 END (PASS) - Cookbook sre.ceph.roll-restart-reboot-server (exit_code=0) rolling restart_daemons on P{cephosd1001*} and (A:cephosd)
[14:49:43] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to wmf for Jonathan Tweed - https://phabricator.wikimedia.org/T376777#10214623 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Hi Jonathan, I've added you to the cn=wmf LDAP group. You should be able to access the service...
[14:50:13] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org
[14:50:33] <logmsgbot>	 !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[14:50:37] <wikibugs>	 (03CR) 10Scott French: "Diffs look like what I'd expect :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078964 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French)
[14:50:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet
[14:51:07] <logmsgbot>	 !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'.
[14:51:21] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[14:51:35] <logmsgbot>	 !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[14:51:47] <jinxer-wm>	 RESOLVED: HelmReleaseBadStatus: Helm release kube-system/calico on k8s@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[14:52:00] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service cloudidm2001-dev:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#cloudidm2001-dev:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[14:52:53] <logmsgbot>	 !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on backup[2010-2011].codfw.wmnet with reason: T376800
[14:53:07] <logmsgbot>	 !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on backup[2010-2011].codfw.wmnet with reason: T376800
[14:53:12] <wikibugs>	 (03CR) 10Volans: sre.hosts.provision: warn when the BMC firmware is old for Supermicro (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1078961 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey)
[14:54:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet
[14:54:34] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] echostore: pilot service mesh support in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078964 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French)
[14:55:20] <wikibugs>	 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#10214641 (10elukey) Last issue worth to report is T371416#10214548. The backup1012 host seems to have a very old fir...
[14:57:51] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "In sessionstore I added also:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078964 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French)
[14:58:57] <logmsgbot>	 !log brouberol@cumin1002 START - Cookbook sre.ceph.roll-restart-reboot-server rolling restart_daemons on P{cephosd1001*} and (A:cephosd)
[14:59:21] <logmsgbot>	 !log brouberol@cumin1002 END (PASS) - Cookbook sre.ceph.roll-restart-reboot-server (exit_code=0) rolling restart_daemons on P{cephosd1001*} and (A:cephosd)
[15:00:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host crm2001.codfw.wmnet
[15:01:13] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:01:36] <wikibugs>	 (03CR) 10Elukey: sre.hosts.provision: warn when the BMC firmware is old for Supermicro (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1078961 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey)
[15:01:48] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudlb2004-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTARTand with Dell SCP reboot policy FORCED
[15:02:49] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cr2-eqsin with reason: router replacement
[15:03:04] <wikibugs>	 (03PS1) 10Klausman: modules/admin: add ml-lab-users to render group [puppet] - 10https://gerrit.wikimedia.org/r/1078963 (https://phabricator.wikimedia.org/T376380)
[15:03:04] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cr2-eqsin with reason: router replacement
[15:03:59] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cr2-eqsin with reason: router replacement
[15:04:04] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cr2-eqsin with reason: router replacement
[15:04:19] <wikibugs>	 (03PS5) 10Brouberol: wip: ceph rolling restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1078959 (https://phabricator.wikimedia.org/T375071)
[15:04:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host crm2001.codfw.wmnet
[15:04:39] <wikibugs>	 (03Abandoned) 10Hnowlan: trafficserver: route knowledge-gap path via rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/946928 (https://phabricator.wikimedia.org/T342213) (owner: 10Hnowlan)
[15:04:47] <wikibugs>	 (03PS2) 10Klausman: modules/admin: add ml-lab-users to render group [puppet] - 10https://gerrit.wikimedia.org/r/1078963 (https://phabricator.wikimedia.org/T376380)
[15:05:21] <wikibugs>	 (03PS2) 10Scott French: echostore: pilot service mesh support in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078964 (https://phabricator.wikimedia.org/T376766)
[15:05:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet
[15:06:40] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org
[15:06:58] <wikibugs>	 (03CR) 10Scott French: "Excellent catch, Luca. So, it turns out it this _was_ using 8082 for both, which would have encountered the same bind error you observed. " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078964 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French)
[15:07:03] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for dns1006.wikimedia.org
[15:07:04] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns1006.wikimedia.org
[15:07:04] <wikibugs>	 (03CR) 10Elukey: "Please ping Infrastructure Foundations before adding new Posix groups :)" [puppet] - 10https://gerrit.wikimedia.org/r/1077914 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman)
[15:07:07] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to wmf for Jonathan Tweed - https://phabricator.wikimedia.org/T376777#10214673 (10Aklapper) @MoritzMuehlenhoff: Per the steps on https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#LDAP_access , please also add new `ldap/wmf`...
[15:09:04] <wikibugs>	 (03PS1) 10Cwhite: opensearch: gate curator install [puppet] - 10https://gerrit.wikimedia.org/r/1078970 (https://phabricator.wikimedia.org/T362429)
[15:09:10] <wikibugs>	 (03CR) 10Elukey: "Technically this adds a SUDO rule so IIRC it should wait for the Infrastructure Foundations meeting that happens every Monday, in practice" [puppet] - 10https://gerrit.wikimedia.org/r/1078963 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman)
[15:09:23] <wikibugs>	 (03CR) 10Xcollazo: [C:03+1] dse-k8s-services: content_history: version bump image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078923 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena)
[15:09:42] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet
[15:09:59] <mutante>	 !log people.wikimedia.org - rebooting backends
[15:10:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:58] <wikibugs>	 (03CR) 10CI reject: [V:04-1] opensearch: gate curator install [puppet] - 10https://gerrit.wikimedia.org/r/1078970 (https://phabricator.wikimedia.org/T362429) (owner: 10Cwhite)
[15:11:40] <wikibugs>	 (03PS6) 10Brouberol: Define a ceph rolling restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1078959 (https://phabricator.wikimedia.org/T375071)
[15:12:30] <wikibugs>	 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on pediapress.com - https://phabricator.wikimedia.org/T375761#10214692 (10ssingh) @MSantos: Are you still the person responsible for approving these requests? If yes, this needs your approval. If not, apologies for adding you and please feel free to remove...
[15:16:29] <wikibugs>	 (03PS10) 10Arturo Borrero Gonzalez: cloudgw: add IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/1077712 (https://phabricator.wikimedia.org/T374716)
[15:16:39] <wikibugs>	 (03PS1) 10Muehlenhoff: Revert "IDP: Failover IDP service to eqiad." [dns] - 10https://gerrit.wikimedia.org/r/1078971
[15:16:45] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] Bump cert-manager, cfssl-issuer and helm-state metrics resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078957 (https://phabricator.wikimedia.org/T376795) (owner: 10JMeybohm)
[15:16:50] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: cloudgw: add IPv6 support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1077712 (https://phabricator.wikimedia.org/T374716) (owner: 10Arturo Borrero Gonzalez)
[15:16:52] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1077712 (https://phabricator.wikimedia.org/T374716) (owner: 10Arturo Borrero Gonzalez)
[15:17:09] <mutante>	 !log planet.wikimedia.org - rebooting backends
[15:17:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:18] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10214744 (10Papaul) @ayounsi thanks for the feedback
[15:17:35] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] Revert "IDP: Failover IDP service to eqiad." [dns] - 10https://gerrit.wikimedia.org/r/1078971 (owner: 10Muehlenhoff)
[15:17:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Revert "IDP: Failover IDP service to eqiad." [dns] - 10https://gerrit.wikimedia.org/r/1078971 (owner: 10Muehlenhoff)
[15:17:53] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] Revert "IDP: Failover IDP service to eqiad." [dns] - 10https://gerrit.wikimedia.org/r/1078971 (owner: 10Muehlenhoff)
[15:17:56] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] Revert "IDP: Failover IDP service to eqiad." [dns] - 10https://gerrit.wikimedia.org/r/1078971 (owner: 10Muehlenhoff)
[15:17:56] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] Revert "IDP: Failover IDP service to eqiad." [dns] - 10https://gerrit.wikimedia.org/r/1078971 (owner: 10Muehlenhoff)
[15:17:57] <wikibugs>	 (03PS1) 10Cathal Mooney: Add INCLUDEs for newly-assigned IPv6 networks WMCS Codfw [dns] - 10https://gerrit.wikimedia.org/r/1078972 (https://phabricator.wikimedia.org/T376462)
[15:19:05] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 3 others: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587#10214754 (10Papaul) @cmooney thanks for the feedback for the migration let us work with the way it is setup for know and we can look into all t...
[15:19:12] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add INCLUDEs for newly-assigned IPv6 networks WMCS Codfw [dns] - 10https://gerrit.wikimedia.org/r/1078972 (https://phabricator.wikimedia.org/T376462) (owner: 10Cathal Mooney)
[15:19:17] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[15:19:58] <wikibugs>	 (03CR) 10Dzahn: "any ticket related to this?" [dns] - 10https://gerrit.wikimedia.org/r/1078946 (owner: 10Slyngshede)
[15:20:53] <sukhe>	 !log running dummy authdns-update
[15:20:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:21:40] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org
[15:22:12] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add IPv6 reverse entry for cloudsw1-b1-codfw interface IPs - cmooney@cumin1002"
[15:22:36] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add IPv6 reverse entry for cloudsw1-b1-codfw interface IPs - cmooney@cumin1002"
[15:22:36] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:23:27] <wikibugs>	 (03PS2) 10Cathal Mooney: Add INCLUDEs for newly-assigned IPv6 networks WMCS Codfw [dns] - 10https://gerrit.wikimedia.org/r/1078972 (https://phabricator.wikimedia.org/T376462)
[15:23:34] <mutante>	 !log stewards* - rebooting machines - T351202
[15:23:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:23:37] <stashbot>	 T351202: stewards1001 / stewards2001: automatically subscribe stewards to mailman lists (was: Enable API access for Mailman3) - https://phabricator.wikimedia.org/T351202
[15:24:04] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.dns.admin DNS admin: depool site eqsin [reason: eqsin cr replacementAA, T375961]
[15:24:29] <logmsgbot>	 !log fabfur@cumin1002 END (FAIL) - Cookbook sre.dns.admin (exit_code=99) DNS admin: depool site eqsin [reason: eqsin cr replacementAA, T375961]
[15:24:35] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.dns.admin DNS admin: depool site eqsin [reason: eqsin cr replacement, T375961]
[15:24:38] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site eqsin [reason: eqsin cr replacement, T375961]
[15:25:12] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] cloudgw: add IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/1077712 (https://phabricator.wikimedia.org/T374716) (owner: 10Arturo Borrero Gonzalez)
[15:25:13] <fabfur>	 !log eqsin depooled for T375961
[15:25:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:26:01] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.wipe-cache idp.wikimedia.org on all recursors
[15:26:05] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) idp.wikimedia.org on all recursors
[15:26:25] <wikibugs>	 (03CR) 10Volans: "I know is a WIP, but given our chat on IRC I got curious and left some comments." [cookbooks] - 10https://gerrit.wikimedia.org/r/1078959 (https://phabricator.wikimedia.org/T375071) (owner: 10Brouberol)
[15:26:42] <wikibugs>	 (03PS3) 10Klausman: modules/admin: add ml-lab-users to render group [puppet] - 10https://gerrit.wikimedia.org/r/1078963 (https://phabricator.wikimedia.org/T376380)
[15:27:22] <wikibugs>	 (03CR) 10Klausman: "I have also tested just now: radeontop already works fine without elevated privileges. That is, once you're in the render group, you don't" [puppet] - 10https://gerrit.wikimedia.org/r/1078963 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman)
[15:28:25] <wikibugs>	 (03PS4) 10Klausman: modules/admin: add ml-lab-users to render group [puppet] - 10https://gerrit.wikimedia.org/r/1078963 (https://phabricator.wikimedia.org/T376380)
[15:30:25] <wikibugs>	 (03PS1) 10Bking: stat hosts: enable zRAM-based swap [puppet] - 10https://gerrit.wikimedia.org/r/1078973 (https://phabricator.wikimedia.org/T376813)
[15:30:31] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org
[15:30:57] <wikibugs>	 (03CR) 10CI reject: [V:04-1] stat hosts: enable zRAM-based swap [puppet] - 10https://gerrit.wikimedia.org/r/1078973 (https://phabricator.wikimedia.org/T376813) (owner: 10Bking)
[15:31:02] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078973 (https://phabricator.wikimedia.org/T376813) (owner: 10Bking)
[15:33:45] <wikibugs>	 (03PS2) 10Bking: stat hosts: enable zRAM-based swap [puppet] - 10https://gerrit.wikimedia.org/r/1078973 (https://phabricator.wikimedia.org/T376813)
[15:34:15] <wikibugs>	 (03CR) 10Scott French: "Alright, looking at the diffs once overriding `app.port`, there's a gotcha here: the prometheus port annotation is now incorrect - i.e., `" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078964 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French)
[15:34:26] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078973 (https://phabricator.wikimedia.org/T376813) (owner: 10Bking)
[15:38:43] <wikibugs>	 (03PS3) 10Bking: stat hosts: enable zRAM-based swap [puppet] - 10https://gerrit.wikimedia.org/r/1078973 (https://phabricator.wikimedia.org/T376813)
[15:39:04] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078973 (https://phabricator.wikimedia.org/T376813) (owner: 10Bking)
[15:40:10] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Compared against Netbox and did the v6 PTR fun. Looks good 😊" [dns] - 10https://gerrit.wikimedia.org/r/1078972 (https://phabricator.wikimedia.org/T376462) (owner: 10Cathal Mooney)
[15:40:43] <wikibugs>	 (03PS2) 10Cwhite: opensearch: gate curator install [puppet] - 10https://gerrit.wikimedia.org/r/1078970 (https://phabricator.wikimedia.org/T362429)
[15:41:47] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Add INCLUDEs for newly-assigned IPv6 networks WMCS Codfw [dns] - 10https://gerrit.wikimedia.org/r/1078972 (https://phabricator.wikimedia.org/T376462) (owner: 10Cathal Mooney)
[15:43:20] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and A:wikidough
[15:43:35] <wikibugs>	 (03PS7) 10Brouberol: Define a ceph rolling restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1078959 (https://phabricator.wikimedia.org/T375071)
[15:43:39] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[15:43:52] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[15:43:54] <jinxer-wm>	 FIRING: [2x] PyBalBGPUnstable: PyBal BGP sessions on instance lvs5005 with peer 103.102.166.130 are failing #page - https://wikitech.wikimedia.org/wiki/PyBal#Alerts  - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable
[15:43:56] <wikibugs>	 (03CR) 10Brouberol: Define a ceph rolling restart/reboot cookbook (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1078959 (https://phabricator.wikimedia.org/T375071) (owner: 10Brouberol)
[15:44:01] <sukhe>	 oh ok
[15:44:03] <sukhe>	 good that it paged
[15:44:04] <sukhe>	 expected
[15:44:08] <sukhe>	 !incidents
[15:44:08] <sirenbot>	 5300 (RESOLVED)  Manual (paged) by Scott French (swfrench@wikimedia.org): need assistance - calico issues in codfw (please join #wikimedia-sre)
[15:44:08] <sirenbot>	 5302 (RESOLVED)  ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqsin)
[15:44:09] <sirenbot>	 5304 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (kartotherian.discovery.wmnet eqsin)
[15:44:09] <sirenbot>	 5306 (RESOLVED)  [2x] ProbeDown sre (ip4 probes/service codfw)
[15:44:09] <sirenbot>	 5305 (RESOLVED)  GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw)
[15:44:09] <sirenbot>	 5303 (RESOLVED)  ProbeDown sre (ip4 probes/service codfw)
[15:44:09] <sirenbot>	 5301 (RESOLVED)  ProbeDown sre (10.2.1.88 ip4 mw-wikifunctions:4451 probes/service http_mw-wikifunctions_ip4 codfw)
[15:44:10] <sirenbot>	 5299 (RESOLVED)  GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw)
[15:44:18] <sukhe>	 !incidents
[15:44:18] <sirenbot>	 5300 (RESOLVED)  Manual (paged) by Scott French (swfrench@wikimedia.org): need assistance - calico issues in codfw (please join #wikimedia-sre)
[15:44:18] <sirenbot>	 5302 (RESOLVED)  ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqsin)
[15:44:18] <sirenbot>	 5304 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (kartotherian.discovery.wmnet eqsin)
[15:44:19] <sirenbot>	 5306 (RESOLVED)  [2x] ProbeDown sre (ip4 probes/service codfw)
[15:44:19] <sirenbot>	 5305 (RESOLVED)  GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw)
[15:44:19] <sirenbot>	 5303 (RESOLVED)  ProbeDown sre (ip4 probes/service codfw)
[15:44:19] <sirenbot>	 5301 (RESOLVED)  ProbeDown sre (10.2.1.88 ip4 mw-wikifunctions:4451 probes/service http_mw-wikifunctions_ip4 codfw)
[15:44:20] <sirenbot>	 5299 (RESOLVED)  GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw)
[15:44:24] <sukhe>	 hmm
[15:44:26] <swfrench-wmf>	 thanks, sukhe!
[15:44:28] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.hosts.reboot-single for host mx-in1001.wikimedia.org
[15:44:31] <swfrench-wmf>	 hmmm ... that's odd
[15:44:37] <sukhe>	 yeah
[15:44:41] <claime>	 hasn't percolated to VO yet?
[15:44:52] <swfrench-wmf>	 my pager still has not gone off, so no?
[15:45:28] <sukhe>	 that's quite the lag
[15:45:31] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns2005.wikimedia.org
[15:45:54] <wikibugs>	 (03PS8) 10Brouberol: Define a ceph rolling restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1078959 (https://phabricator.wikimedia.org/T375071)
[15:46:53] <volans>	 VO is still green to me
[15:47:03] <sukhe>	 same here, and also on the web interface
[15:47:12] <sukhe>	 this is not a false postiive, it's an actual alert. so it's not even taht
[15:47:25] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host mc-misc2002.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[15:48:18] <logmsgbot>	 !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mx-in1001.wikimedia.org
[15:48:23] <sukhe>	 !incidents
[15:48:24] <sirenbot>	 5300 (RESOLVED)  Manual (paged) by Scott French (swfrench@wikimedia.org): need assistance - calico issues in codfw (please join #wikimedia-sre)
[15:48:24] <sirenbot>	 5302 (RESOLVED)  ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqsin)
[15:48:25] <sirenbot>	 5304 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (kartotherian.discovery.wmnet eqsin)
[15:48:25] <sirenbot>	 5306 (RESOLVED)  [2x] ProbeDown sre (ip4 probes/service codfw)
[15:48:25] <sirenbot>	 5305 (RESOLVED)  GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw)
[15:48:25] <sirenbot>	 5303 (RESOLVED)  ProbeDown sre (ip4 probes/service codfw)
[15:48:25] <sirenbot>	 5301 (RESOLVED)  ProbeDown sre (10.2.1.88 ip4 mw-wikifunctions:4451 probes/service http_mw-wikifunctions_ip4 codfw)
[15:48:26] <sirenbot>	 5299 (RESOLVED)  GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw)
[15:48:27] <sukhe>	 fun
[15:48:32] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.hosts.reboot-single for host mx-in2001.wikimedia.org
[15:48:39] <sukhe>	 so yeah, I guess a ticket is in order
[15:48:54] <jinxer-wm>	 FIRING: [3x] PyBalBGPUnstable: PyBal BGP sessions on instance lvs5004 with peer 103.102.166.130 are failing #page - https://wikitech.wikimedia.org/wiki/PyBal#Alerts  - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable
[15:49:04] <sukhe>	 I am going to downtime this in the meanwhile
[15:49:30] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on lvs[5004-5006].eqsin.wmnet with reason: site is depooled, cr2-eqsin is being replaced
[15:49:45] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on lvs[5004-5006].eqsin.wmnet with reason: site is depooled, cr2-eqsin is being replaced
[15:51:18] <sukhe>	 so yeah, no sign of this page on VO anywhere
[15:51:59] <volans>	 karma is not logging errors
[15:52:31] <logmsgbot>	 !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mx-in2001.wikimedia.org
[15:52:38] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mc-misc2002.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[15:53:17] <sukhe>	 !log running authdns-update
[15:53:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:58] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host mc-misc2002.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[15:54:34] <logmsgbot>	 !log sukhe@cumin1002 END (ERROR) - Cookbook sre.dns.roll-reboot (exit_code=97) rolling reboot on A:dnsbox
[15:55:04] <sukhe>	 ^ resolving issues with authdns-update
[15:57:11] <volans>	 VO status page is green (see -private for a related thing though :D )
[15:58:00] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for dns2005.wikimedia.org
[15:58:00] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns2005.wikimedia.org
[16:00:25] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-misc2002.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[16:01:30] <wikibugs>	 (03CR) 10Btullis: stat hosts: enable zRAM-based swap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1078973 (https://phabricator.wikimedia.org/T376813) (owner: 10Bking)
[16:02:51] <wikibugs>	 (03CR) 10Btullis: stat hosts: enable zRAM-based swap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1078973 (https://phabricator.wikimedia.org/T376813) (owner: 10Bking)
[16:03:13] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host mc-misc2001.codfw.wmnet with OS bookworm
[16:03:13] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host mc-misc2002.codfw.wmnet with OS bookworm
[16:03:19] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10214917 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host mc-misc2001.codfw.wmnet with OS bookworm
[16:03:21] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10214918 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host mc-misc2002.codfw.wmnet with OS bookworm
[16:04:12] <wikibugs>	 (03CR) 10Cathal Mooney: "Agreed this looks good to proceed.  I'll  wait until someone from Traffic can give the nod however." [dns] - 10https://gerrit.wikimedia.org/r/1076713 (https://phabricator.wikimedia.org/T374715) (owner: 10Cathal Mooney)
[16:04:21] <wikibugs>	 (03PS4) 10Cathal Mooney: Delegate IPv6 ranges allocated for WMCS Openstack networks in codfw [dns] - 10https://gerrit.wikimedia.org/r/1076713 (https://phabricator.wikimedia.org/T374715)
[16:04:54] <wikibugs>	 (03CR) 10Scott French: "Looking at the current state of sessionstore prom metrics in staging, specifically at `up{kubernetes_namespace="sessionstore", prometheus=" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078964 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French)
[16:05:09] <wikibugs>	 (03CR) 10Volans: Define a ceph rolling restart/reboot cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1078959 (https://phabricator.wikimedia.org/T375071) (owner: 10Brouberol)
[16:05:25] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.netbox
[16:07:39] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:20:24] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.netbox
[16:21:28] <sukhe>	 !log forcing commit 95858bae44a2ccae5e7fb1fe793cd3bbc7ed9c6b through sre.dns.netbox
[16:21:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:55] <wikibugs>	 (03PS4) 10Bking: stat hosts: enable zRAM-based swap [puppet] - 10https://gerrit.wikimedia.org/r/1078973 (https://phabricator.wikimedia.org/T376813)
[16:23:06] <wikibugs>	 (03CR) 10Bking: stat hosts: enable zRAM-based swap (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1078973 (https://phabricator.wikimedia.org/T376813) (owner: 10Bking)
[16:23:13] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078973 (https://phabricator.wikimedia.org/T376813) (owner: 10Bking)
[16:23:27] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: picking up zone file 1.0.e.f.0.0.1.a.0.8.c.e.2.0.a.2.ip6.arpa - sukhe@cumin1002"
[16:23:32] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: picking up zone file 1.0.e.f.0.0.1.a.0.8.c.e.2.0.a.2.ip6.arpa - sukhe@cumin1002"
[16:23:32] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:23:57] <sukhe>	 !log running authdns-update to fix broken zone files on dns2004
[16:23:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:49] <wikibugs>	 (03PS3) 10Scott French: echostore: pilot service mesh support in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078964 (https://phabricator.wikimedia.org/T376766)
[16:25:49] <wikibugs>	 (03PS1) 10Scott French: kask: open app.port if mesh is enabled on another port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078979 (https://phabricator.wikimedia.org/T376766)
[16:30:30] <wikibugs>	 (03CR) 10Scott French: "Alright, I think this should do the job. I'll merge this and update sessionstore staging to confirm metrics come back, then merge the next" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078979 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French)
[16:30:30] <wikibugs>	 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, and 2 others: dns: integrate PTR support for 2a02:ec80:a100::/48 - https://phabricator.wikimedia.org/T376462#10214998 (10cmooney) The delegations for the 4 subnets used so far on the infra-side are working also: ` cmooney@cumin1002:~$...
[16:31:16] <wikibugs>	 (03CR) 10Scott French: "Alright, I think I have a solution, now stacked below this patch." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078964 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French)
[16:32:01] <bvibber>	 !log starting requeueTranscodes on old school mwmaint2002 after the k8s blowup last night
[16:32:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:32:25] <wikibugs>	 (03PS5) 10Bking: stat hosts: enable zRAM-based swap [puppet] - 10https://gerrit.wikimedia.org/r/1078973 (https://phabricator.wikimedia.org/T376813)
[16:32:49] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078973 (https://phabricator.wikimedia.org/T376813) (owner: 10Bking)
[16:32:58] <wikibugs>	 (03PS2) 10Scott French: kask: open app.port if mesh is enabled on another port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078979 (https://phabricator.wikimedia.org/T376766)
[16:32:59] <wikibugs>	 (03PS4) 10Scott French: echostore: pilot service mesh support in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078964 (https://phabricator.wikimedia.org/T376766)
[16:34:13] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[16:41:24] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1198.eqiad.wmnet onto db1157.eqiad.wmnet
[16:44:38] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new entries for codfw cr IPs facin cloudsw - cmooney@cumin1002"
[16:44:42] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new entries for codfw cr IPs facin cloudsw - cmooney@cumin1002"
[16:44:42] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:44:46] <wikibugs>	 (03PS1) 10Kosta Harlan: ipoid: Bump activeDeadlineSeconds to 24 hours [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078982 (https://phabricator.wikimedia.org/T374414)
[16:46:04] <wikibugs>	 (03CR) 10Máté Szabó: [C:03+2] ipoid: Bump activeDeadlineSeconds to 24 hours [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078982 (https://phabricator.wikimedia.org/T374414) (owner: 10Kosta Harlan)
[16:47:06] <wikibugs>	 (03Merged) 10jenkins-bot: ipoid: Bump activeDeadlineSeconds to 24 hours [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078982 (https://phabricator.wikimedia.org/T374414) (owner: 10Kosta Harlan)
[16:48:09] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: python_deploy::venv: transform into a define [puppet] - 10https://gerrit.wikimedia.org/r/1078707
[16:48:09] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: fastapi: Add define to run a fastapi application [puppet] - 10https://gerrit.wikimedia.org/r/1078708 (https://phabricator.wikimedia.org/T371782)
[16:48:09] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: profile::conftool: add web interface for requestctl [puppet] - 10https://gerrit.wikimedia.org/r/1078709 (https://phabricator.wikimedia.org/T371782)
[16:48:09] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: hiddenparma: add to deployment server [puppet] - 10https://gerrit.wikimedia.org/r/1078983 (https://phabricator.wikimedia.org/T371782)
[16:48:11] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: acme_chief: add SAN for requestctl.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1078984 (https://phabricator.wikimedia.org/T371782)
[16:48:13] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: role::alerting_host: add web interface for requestctl [puppet] - 10https://gerrit.wikimedia.org/r/1078985 (https://phabricator.wikimedia.org/T371782)
[16:48:26] <logmsgbot>	 !log kharlan@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply
[16:48:29] <logmsgbot>	 !log kharlan@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply
[16:50:05] <logmsgbot>	 !log kharlan@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply
[16:50:09] <logmsgbot>	 !log kharlan@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply
[16:50:12] <logmsgbot>	 !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply
[16:50:24] <logmsgbot>	 !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply
[16:50:46] <logmsgbot>	 !log kharlan@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply
[16:50:48] <logmsgbot>	 !log kharlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply
[16:54:46] <wikibugs>	 (03PS1) 10David Caro: p:toolforge::proxy: add toolforge api site config [puppet] - 10https://gerrit.wikimedia.org/r/1078986 (https://phabricator.wikimedia.org/T362066)
[16:55:19] <wikibugs>	 (03CR) 10CI reject: [V:04-1] p:toolforge::proxy: add toolforge api site config [puppet] - 10https://gerrit.wikimedia.org/r/1078986 (https://phabricator.wikimedia.org/T362066) (owner: 10David Caro)
[16:58:37] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "Not 100% straightforward to read but it does the job, I'd test it to see how it works!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078979 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French)
[16:58:58] <wikibugs>	 (03PS2) 10David Caro: p:toolforge::proxy: add toolforge api site config [puppet] - 10https://gerrit.wikimedia.org/r/1078986 (https://phabricator.wikimedia.org/T362066)
[16:59:37] <wikibugs>	 (03CR) 10CI reject: [V:04-1] p:toolforge::proxy: add toolforge api site config [puppet] - 10https://gerrit.wikimedia.org/r/1078986 (https://phabricator.wikimedia.org/T362066) (owner: 10David Caro)
[16:59:45] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P69523 and previous config saved to /var/cache/conftool/dbconfig/20241009-165944-ladsgroup.json
[17:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241009T1700)
[17:00:47] <wikibugs>	 (03PS3) 10David Caro: p:toolforge::proxy: add toolforge api site config [puppet] - 10https://gerrit.wikimedia.org/r/1078986 (https://phabricator.wikimedia.org/T362066)
[17:01:02] <wikibugs>	 (03CR) 10Elukey: [C:03+2] "Had a chat with Riccardo on meet, we are going to proceed with this one for the moment." [cookbooks] - 10https://gerrit.wikimedia.org/r/1078961 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey)
[17:02:00] <wikibugs>	 (03PS4) 10David Caro: p:toolforge::proxy: add toolforge api site config [puppet] - 10https://gerrit.wikimedia.org/r/1078986 (https://phabricator.wikimedia.org/T362066)
[17:03:53] <wikibugs>	 (03CR) 10CI reject: [V:04-1] p:toolforge::proxy: add toolforge api site config [puppet] - 10https://gerrit.wikimedia.org/r/1078986 (https://phabricator.wikimedia.org/T362066) (owner: 10David Caro)
[17:04:22] <wikibugs>	 (03PS5) 10David Caro: p:toolforge::proxy: add toolforge api site config [puppet] - 10https://gerrit.wikimedia.org/r/1078986 (https://phabricator.wikimedia.org/T362066)
[17:04:29] <swfrench-wmf>	 !incidents
[17:04:29] <sirenbot>	 5307 (UNACKED)  Host cr2-eqsin - PING  - Packet loss = 100%
[17:04:29] <sirenbot>	 5300 (RESOLVED)  Manual (paged) by Scott French (swfrench@wikimedia.org): need assistance - calico issues in codfw (please join #wikimedia-sre)
[17:04:30] <sirenbot>	 5302 (RESOLVED)  ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqsin)
[17:04:30] <sirenbot>	 5304 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (kartotherian.discovery.wmnet eqsin)
[17:04:30] <sirenbot>	 5306 (RESOLVED)  [2x] ProbeDown sre (ip4 probes/service codfw)
[17:04:30] <sirenbot>	 5305 (RESOLVED)  GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw)
[17:04:30] <sirenbot>	 5303 (RESOLVED)  ProbeDown sre (ip4 probes/service codfw)
[17:04:31] <sirenbot>	 5301 (RESOLVED)  ProbeDown sre (10.2.1.88 ip4 mw-wikifunctions:4451 probes/service http_mw-wikifunctions_ip4 codfw)
[17:04:31] <sirenbot>	 5299 (RESOLVED)  GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw)
[17:04:42] <swfrench-wmf>	 !ack 5307
[17:04:43] <sirenbot>	 5307 (ACKED)  Host cr2-eqsin - PING  - Packet loss = 100%
[17:04:47] <volans>	 what's happening to alerts today?
[17:04:50] <swfrench-wmf>	 sukhe: expected?
[17:04:58] <swfrench-wmf>	 or is this just wildly old
[17:04:58] <volans>	 earlier no VO alert now we got the alert via email but no here
[17:05:24] <wikibugs>	 (03PS6) 10David Caro: p:toolforge::proxy: add toolforge api site config [puppet] - 10https://gerrit.wikimedia.org/r/1078986 (https://phabricator.wikimedia.org/T362066)
[17:07:04] <sukhe>	 swfrench-wmf: expected yep
[17:07:09] <sukhe>	 site is depooled
[17:07:29] <sukhe>	 but this was downtimed 
[17:07:37] <swfrench-wmf>	 I think it might have expired?
[17:07:59] <sukhe>	 yep I thought it was four but was two hours: https://phabricator.wikimedia.org/T375961#10214665
[17:08:00] <swfrench-wmf>	 I see the downtime was added at 15:04 UTC
[17:08:34] <sukhe>	 I think we can ACK it but it will be up soon so might be worthwhile to not downtime it again
[17:08:41] <sukhe>	 papaul and robh are working on it
[17:08:55] <wikibugs>	 (03PS1) 10RLazarus: deployment_server: Add `helm list` pagination to mwscript-cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1078988 (https://phabricator.wikimedia.org/T376795)
[17:08:57] <swfrench-wmf>	 sounds good - thanks!
[17:09:01] <sukhe>	 thanks and sorry for the noise
[17:09:28] <sukhe>	 on that note, I will file a task for why we didn't get paged for the pybal alert
[17:12:27] <logmsgbot>	 !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host alert2002.wikimedia.org
[17:12:28] <logmsgbot>	 !log denisse@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host alert2002.wikimedia.org
[17:13:02] <wikibugs>	 (03PS3) 10JHathaway: vrts_aliases: add a basic safeguard, improve existing safeguards [puppet] - 10https://gerrit.wikimedia.org/r/1074433 (https://phabricator.wikimedia.org/T374090)
[17:13:09] <logmsgbot>	 !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host alert1002.wikimedia.org
[17:13:22] <wikibugs>	 (03CR) 10David Caro: [V:03+1] "Tested in toolsbeta: https://api.beta.toolforge.org" [puppet] - 10https://gerrit.wikimedia.org/r/1078986 (https://phabricator.wikimedia.org/T362066) (owner: 10David Caro)
[17:13:47] <wikibugs>	 (03CR) 10JHathaway: "Getting back to this, cut another patch with a few improvements, please take a look, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1074433 (https://phabricator.wikimedia.org/T374090) (owner: 10JHathaway)
[17:14:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] vrts_aliases: add a basic safeguard, improve existing safeguards [puppet] - 10https://gerrit.wikimedia.org/r/1074433 (https://phabricator.wikimedia.org/T374090) (owner: 10JHathaway)
[17:15:12] <wikibugs>	 (03CR) 10Scott French: [C:03+1] deployment_server: Add `helm list` pagination to mwscript-cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1078988 (https://phabricator.wikimedia.org/T376795) (owner: 10RLazarus)
[17:15:52] <wikibugs>	 (03PS4) 10JHathaway: vrts_aliases: add a basic safeguard, improve existing safeguards [puppet] - 10https://gerrit.wikimedia.org/r/1074433 (https://phabricator.wikimedia.org/T374090)
[17:17:01] <wikibugs>	 (03PS1) 10Cathal Mooney: Add elements for WMCS IPv6 range in codfw  2a02:ec80:a100::/48 [homer/public] - 10https://gerrit.wikimedia.org/r/1078990 (https://phabricator.wikimedia.org/T245495)
[17:18:02] <rzl>	 bvibber: I'm deleting the artifacts from those video transcode jobs on kubernetes -- just FYI the logs won't be around for seven days as promised :) is that okay or do you need to collect anything first?
[17:20:09] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] deployment_server: Add `helm list` pagination to mwscript-cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1078988 (https://phabricator.wikimedia.org/T376795) (owner: 10RLazarus)
[17:21:41] <logmsgbot>	 !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host alert1002.wikimedia.org
[17:23:00] <logmsgbot>	 !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host alert2002.wikimedia.org
[17:23:01] <logmsgbot>	 !log denisse@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host alert2002.wikimedia.org
[17:23:10] <logmsgbot>	 !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host alert1002.wikimedia.org
[17:23:42] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10215165 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host mc-misc2001.codfw.wmnet with OS bookworm executed with errors: - mc-misc20...
[17:23:43] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10215166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host mc-misc2002.codfw.wmnet with OS bookworm executed with errors: - mc-misc20...
[17:26:44] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] kask: open app.port if mesh is enabled on another port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078979 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French)
[17:27:34] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] echostore: pilot service mesh support in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078964 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French)
[17:29:56] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P69525 and previous config saved to /var/cache/conftool/dbconfig/20241009-172956-ladsgroup.json
[17:31:42] <logmsgbot>	 !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host alert1002.wikimedia.org
[17:32:07] <icinga-wm>	 PROBLEM - Host cr2-eqsin is DOWN: PING CRITICAL - Packet loss = 100%
[17:32:20] <sukhe>	 ^ expected
[17:33:13] <jinxer-wm>	 FIRING: KeyholderUnarmed: 1 unarmed Keyholder key(s) on alert1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[17:34:16] <logmsgbot>	 !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host grafana1002.eqiad.wmnet
[17:34:28] <jinxer-wm>	 RESOLVED: KeyholderUnarmed: 1 unarmed Keyholder key(s) on alert1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[17:35:24] <wikibugs>	 (03CR) 10Scott French: "Thank you both for the reviews!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078979 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French)
[17:35:28] <wikibugs>	 (03CR) 10Scott French: [C:03+2] kask: open app.port if mesh is enabled on another port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078979 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French)
[17:36:30] <wikibugs>	 (03Merged) 10jenkins-bot: kask: open app.port if mesh is enabled on another port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078979 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French)
[17:37:49] <wikibugs>	 (03Abandoned) 10Andrea Denisse: alert: Failover from alert2002 to alert1002 [puppet] - 10https://gerrit.wikimedia.org/r/1064828 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse)
[17:38:17] <logmsgbot>	 !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana1002.eqiad.wmnet
[17:40:58] <logmsgbot>	 !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet
[17:41:36] <logmsgbot>	 !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/sessionstore: apply
[17:41:50] <logmsgbot>	 !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply
[17:42:29] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "No concerns I can see: PTRs look good and I don't think there is any concern with the delegation (also not the first one to ns[01].opensta" [dns] - 10https://gerrit.wikimedia.org/r/1076713 (https://phabricator.wikimedia.org/T374715) (owner: 10Cathal Mooney)
[17:44:56] <logmsgbot>	 !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet
[17:45:02] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P69526 and previous config saved to /var/cache/conftool/dbconfig/20241009-174501-ladsgroup.json
[17:45:55] <wikibugs>	 (03CR) 10Scott French: "Thank you both!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078964 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French)
[17:45:57] <wikibugs>	 (03CR) 10Scott French: [C:03+2] echostore: pilot service mesh support in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078964 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French)
[17:46:40] <bvibber>	 rzl: that's fine, i don't need the output :D
[17:46:56] <rzl>	 thanks!
[17:47:00] <wikibugs>	 (03Merged) 10jenkins-bot: echostore: pilot service mesh support in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078964 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French)
[17:47:16] <bvibber>	 luv 2 break the site overnight \o/
[17:48:26] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:49:44] <wikibugs>	 (03PS1) 10Ssingh: varnish: add pediapress.com to allowed maps domains [puppet] - 10https://gerrit.wikimedia.org/r/1078994
[17:50:27] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4261/co" [puppet] - 10https://gerrit.wikimedia.org/r/1078994 (owner: 10Ssingh)
[17:50:51] <wikibugs>	 (03PS2) 10Ssingh: varnish: add pediapress.com to allowed maps domains [puppet] - 10https://gerrit.wikimedia.org/r/1078994 (https://phabricator.wikimedia.org/T375761)
[17:51:33] <logmsgbot>	 !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/echostore: apply
[17:51:44] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4262/co" [puppet] - 10https://gerrit.wikimedia.org/r/1078994 (https://phabricator.wikimedia.org/T375761) (owner: 10Ssingh)
[17:51:46] <logmsgbot>	 !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/echostore: apply
[17:53:54] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host mc-misc2001.codfw.wmnet with OS bookworm
[17:53:55] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host mc-misc2002.codfw.wmnet with OS bookworm
[17:53:59] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "This should not be merged until the request has been approved in the ticket above but the patch exists." [puppet] - 10https://gerrit.wikimedia.org/r/1078994 (https://phabricator.wikimedia.org/T375761) (owner: 10Ssingh)
[17:54:00] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10215245 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host mc-misc2001.codfw.wmnet with OS bookworm
[17:54:02] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10215246 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host mc-misc2002.codfw.wmnet with OS bookworm
[17:58:19] <zabe>	 !log zabe@mwmaint2002:~$ cat /home/zabe/s5.txt | xargs -I{} bash -c "echo {}; mwscript extensions/WikimediaMaintenance/migrateESRefToContentTable.php {} --skip /home/zabe/text_table_cleanup/{} --dump /home/zabe/text_table_dump/{} --sleep 1" # T183490
[17:58:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:58:22] <stashbot>	 T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490
[17:58:28] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:01:44] <logmsgbot>	 !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/echostore: apply
[18:01:54] <logmsgbot>	 !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/echostore: apply
[18:03:08] <icinga-wm>	 RECOVERY - Host cr2-eqsin.mgmt is UP: PING OK - Packet loss = 0%, RTA = 222.67 ms
[18:03:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:04:09] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release echostore/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=echostore - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[18:06:15] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-misc2001.codfw.wmnet with reason: host reimage
[18:07:50] <jinxer-wm>	 FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=35&orgId=1&from=now-6M&to=now - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh
[18:08:58] <logmsgbot>	 !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host prometheus7001.magru.wmnet
[18:09:52] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-misc2001.codfw.wmnet with reason: host reimage
[18:10:49] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:10:53] <icinga-wm>	 RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:10:56] <icinga-wm>	 RECOVERY - Host cr2-eqsin is UP: PING OK - Packet loss = 0%, RTA = 238.56 ms
[18:11:04] <sukhe>	 yayay
[18:11:23] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:11:27] <jinxer-wm>	 FIRING: [2x] CertAlmostExpired: Certificate for service echostore:8082 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#echostore:8082 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[18:11:46] <swfrench-wmf>	 nice! :)
[18:12:18] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-misc2002.codfw.wmnet with reason: host reimage
[18:13:33] <icinga-wm>	 RECOVERY - Host cr2-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 223.83 ms
[18:13:57] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:14:45] <icinga-wm>	 PROBLEM - Host mr1-ulsfo.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[18:14:51] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-misc2002.codfw.wmnet with reason: host reimage
[18:15:08] <logmsgbot>	 !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus7001.magru.wmnet
[18:15:19] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:15:32] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs[5004-5006].eqsin.wmnet
[18:15:34] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs[5004-5006].eqsin.wmnet
[18:15:57] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:16:01] <logmsgbot>	 !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host prometheus6002.drmrs.wmnet
[18:16:57] <logmsgbot>	 !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus6002.drmrs.wmnet
[18:18:20] <logmsgbot>	 !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host prometheus5002.eqsin.wmnet
[18:18:21] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:18:59] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:21:01] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:23:19] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:24:30] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:eqsin and A:dnsbox
[18:24:30] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org
[18:24:45] <wikibugs>	 (03PS1) 10Scott French: Revert "echostore: pilot service mesh support in staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079000 (https://phabricator.wikimedia.org/T376766)
[18:25:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10215329 (10phaultfinder)
[18:26:05] <wikibugs>	 (03CR) 10Scott French: [C:03+2] Revert "echostore: pilot service mesh support in staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079000 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French)
[18:26:08] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[18:26:19] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:26:32] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depool for reclone (T375652)', diff saved to https://phabricator.wikimedia.org/P69527 and previous config saved to /var/cache/conftool/dbconfig/20241009-182632-ladsgroup.json
[18:26:35] <stashbot>	 T375652: Wikimedia\Rdbms\DBQueryError: Error 1062: Duplicate entry '1' for key 'PRIMARY' Function: MediaWiki\CheckUser\Services\CheckUserLogService::addLogEntry - https://phabricator.wikimedia.org/T375652
[18:26:43] <logmsgbot>	 !log denisse@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host prometheus5002.eqsin.wmnet
[18:26:52] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.mysql.clone of db1198.eqiad.wmnet onto db1212.eqiad.wmnet
[18:27:03] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:27:12] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "echostore: pilot service mesh support in staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079000 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French)
[18:28:05] <icinga-wm>	 PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:28:15] <sukhe>	 ^ expected
[18:28:18] <logmsgbot>	 !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host prometheus4002.ulsfo.wmnet
[18:28:43] <jinxer-wm>	 FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[18:29:05] <logmsgbot>	 !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/echostore: apply
[18:29:30] <logmsgbot>	 !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/echostore: apply
[18:29:47] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s3 on db1154 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1212.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1212.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[18:33:43] <jinxer-wm>	 FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[18:33:45] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1078900 (https://phabricator.wikimedia.org/T341553) (owner: 10Lucas Werkmeister (WMDE))
[18:33:45] <icinga-wm>	 RECOVERY - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns5003 is OK: OK: ntpsec.service was restarted after /etc/ntpsec/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[18:34:07] <icinga-wm>	 RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:34:09] <jinxer-wm>	 RESOLVED: HelmReleaseBadStatus: Helm release echostore/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=echostore - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[18:34:10] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[18:34:21] <logmsgbot>	 !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus4002.ulsfo.wmnet
[18:34:53] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org
[18:35:28] <logmsgbot>	 !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host prometheus3003.esams.wmnet
[18:35:38] <icinga-wm>	 PROBLEM - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns6002 is CRITICAL: CRITICAL: Service ntpsec.service has not been restarted after /etc/ntpsec/ntp.conf was changed (gt 2h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[18:37:01] <sukhe>	 this one is interesting
[18:37:53] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s3 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 625.99 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[18:38:07] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s3 on db1154 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 639.89 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[18:38:11] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s3 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 643.16 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[18:38:13] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s3 on clouddb1013 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 645.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[18:38:50] <logmsgbot>	 !log aokoth@cumin1002 START - Cookbook sre.hosts.reboot-single for host vrts2002.codfw.wmnet
[18:38:57] <icinga-wm>	 PROBLEM - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns3003 is CRITICAL: CRITICAL: Service ntpsec.service has not been restarted after /etc/ntpsec/ntp.conf was changed (gt 2h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[18:38:59] <icinga-wm>	 PROBLEM - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns7002 is CRITICAL: CRITICAL: Service ntpsec.service has not been restarted after /etc/ntpsec/ntp.conf was changed (gt 2h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[18:39:42] <icinga-wm>	 ACKNOWLEDGEMENT - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns3003 is CRITICAL: CRITICAL: Service ntpsec.service has not been restarted after /etc/ntpsec/ntp.conf was changed (gt 2h). Sukhbir Singh ntpsec needs a restarts - The acknowledgement expires at: 2024-10-10 15:00:00. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[18:39:42] <icinga-wm>	 ACKNOWLEDGEMENT - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns6002 is CRITICAL: CRITICAL: Service ntpsec.service has not been restarted after /etc/ntpsec/ntp.conf was changed (gt 2h). Sukhbir Singh ntpsec needs a restarts - The acknowledgement expires at: 2024-10-10 15:00:00. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[18:39:42] <icinga-wm>	 ACKNOWLEDGEMENT - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns7002 is CRITICAL: CRITICAL: Service ntpsec.service has not been restarted after /etc/ntpsec/ntp.conf was changed (gt 2h). Sukhbir Singh ntpsec needs a restarts - The acknowledgement expires at: 2024-10-10 15:00:00. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[18:40:11] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:40:39] <sukhe>	 ^ the above is the alert1001 and 2001 removals. will issue a restart later
[18:41:36] <logmsgbot>	 !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus3003.esams.wmnet
[18:45:20] <logmsgbot>	 !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts2002.codfw.wmnet
[18:47:45] <jinxer-wm>	 FIRING: ProbeDown: Service install5002:8080 has failed probes (http_squid_ip4) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:49:53] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns5004.wikimedia.org
[18:51:15] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:51:26] <jinxer-wm>	 RESOLVED: ProbeDown: Service install5002:8080 has failed probes (http_squid_ip4) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:52:30] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service cloudidm2001-dev:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#cloudidm2001-dev:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[18:54:15] <icinga-wm>	 PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:00:15] <icinga-wm>	 RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:01:03] <icinga-wm>	 RECOVERY - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns5004 is OK: OK: ntpsec.service was restarted after /etc/ntpsec/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[19:01:17] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:04:23] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns5004.wikimedia.org
[19:04:23] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:eqsin and A:dnsbox
[19:04:40] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and A:dnsbox
[19:04:40] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org
[19:05:19] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:08:21] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:08:27] <icinga-wm>	 PROBLEM - BFD status on asw1-b3-magru.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:08:31] <sukhe>	 ^ expected
[19:08:45] <icinga-wm>	 PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:09:19] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:12:21] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:12:49] <wikibugs>	 10ops-eqsin: Inbound interface errors - https://phabricator.wikimedia.org/T376837 (10phaultfinder) 03NEW
[19:13:27] <icinga-wm>	 RECOVERY - BFD status on asw1-b3-magru.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:13:45] <icinga-wm>	 RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:13:55] <icinga-wm>	 RECOVERY - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns7001 is OK: OK: ntpsec.service was restarted after /etc/ntpsec/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[19:14:19] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:15:21] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:16:23] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:17:19] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:20:23] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org
[19:20:25] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:21:29] <sukhe>	 ^ this is doh5001 flapping. I will wait for maintenance to settle down and then we can check this
[19:24:25] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:27:25] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:27:30] <logmsgbot>	 !log mforns@deploy2002 helmfile [staging] START helmfile.d/services/commons-impact-analytics: apply
[19:27:43] <logmsgbot>	 !log mforns@deploy2002 helmfile [staging] DONE helmfile.d/services/commons-impact-analytics: apply
[19:27:54] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[19:27:55] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-misc2001.codfw.wmnet with OS bookworm
[19:28:02] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[19:28:03] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-misc2002.codfw.wmnet with OS bookworm
[19:28:07] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10215504 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host mc-misc2001.codfw.wmnet with OS bookworm completed: - mc-misc2001 (**PASS*...
[19:28:09] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10215505 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host mc-misc2002.codfw.wmnet with OS bookworm completed: - mc-misc2002 (**WARN*...
[19:29:05] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10215506 (10Jhancock.wm)
[19:31:27] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:32:19] <wikibugs>	 (03PS1) 10Scott French: Revert^2 "echostore: pilot service mesh support in staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079005 (https://phabricator.wikimedia.org/T376766)
[19:32:49] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10215511 (10Jhancock.wm) 05Open→03Resolved @jijiki this is ready for you.
[19:34:59] <logmsgbot>	 !log mforns@deploy2002 helmfile [eqiad] START helmfile.d/services/commons-impact-analytics: apply
[19:35:18] <logmsgbot>	 !log mforns@deploy2002 helmfile [eqiad] DONE helmfile.d/services/commons-impact-analytics: apply
[19:35:23] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org
[19:36:23] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:37:29] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:38:11] <logmsgbot>	 !log mforns@deploy2002 helmfile [codfw] START helmfile.d/services/commons-impact-analytics: apply
[19:38:27] <logmsgbot>	 !log mforns@deploy2002 helmfile [codfw] DONE helmfile.d/services/commons-impact-analytics: apply
[19:38:29] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:38:53] <wikibugs>	 (03PS1) 10Ladsgroup: mariadb: Add SLAVE MONITOR to promotheus grants [puppet] - 10https://gerrit.wikimedia.org/r/1079006
[19:39:05] <icinga-wm>	 PROBLEM - BGP status on asw1-b4-magru.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:39:15] <icinga-wm>	 PROBLEM - BFD status on asw1-b4-magru.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:39:23] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:43:18] <wikibugs>	 (03CR) 10Ladsgroup: "This was what was missing for pc5, I'm not sure whether we should do it or not but documenting it at least." [puppet] - 10https://gerrit.wikimedia.org/r/1079006 (owner: 10Ladsgroup)
[19:43:29] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:44:51] <icinga-wm>	 RECOVERY - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns7002 is OK: OK: ntpsec.service was restarted after /etc/ntpsec/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[19:45:05] <icinga-wm>	 RECOVERY - BGP status on asw1-b4-magru.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:45:15] <icinga-wm>	 RECOVERY - BFD status on asw1-b4-magru.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:45:29] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:46:00] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org
[19:46:00] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and A:dnsbox
[19:49:29] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:54:25] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:55:34] <swfrench-wmf>	 !log removing echostore staging deployment to unblock breaking change - T376766
[19:55:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:55:37] <stashbot>	 T376766: echostore's TLS certificate expires on 2024-10-13 - https://phabricator.wikimedia.org/T376766
[19:55:44] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on P{dns2006*} and A:dnsbox
[19:55:44] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org
[19:56:29] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:57:00] <sukhe>	 ^ there is sadly no way to silence this. I suspect the BFD session flapping is due to the really old version of Junos on the new cr2-eqsin cr
[19:57:09] <sukhe>	 so we will wait for the Junos upgrade and then revisit this
[19:57:25] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:59:05] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:59:09] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:59:17] <icinga-wm>	 PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:59:25] <icinga-wm>	 PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241009T2000).
[20:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[20:01:29] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:02:29] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:03:05] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 369, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:03:09] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 287, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:03:17] <icinga-wm>	 RECOVERY - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns2006 is OK: OK: ntpsec.service was restarted after /etc/ntpsec/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[20:03:18] <icinga-wm>	 RECOVERY - BFD status on cr1-codfw is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:03:25] <icinga-wm>	 RECOVERY - BFD status on cr2-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:06:41] <wikibugs>	 (03CR) 10Scott French: [C:03+2] Revert^2 "echostore: pilot service mesh support in staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079005 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French)
[20:07:39] <wikibugs>	 (03Merged) 10jenkins-bot: Revert^2 "echostore: pilot service mesh support in staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079005 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French)
[20:08:22] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org
[20:08:22] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P{dns2006*} and A:dnsbox
[20:12:08] <logmsgbot>	 !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/echostore: apply
[20:12:25] <logmsgbot>	 !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/echostore: apply
[20:17:31] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and (A:esams or A:drmrs) and A:dnsbox
[20:17:31] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org
[20:21:27] <icinga-wm>	 PROBLEM - BFD status on asw1-by27-esams.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:21:45] <icinga-wm>	 PROBLEM - BGP status on asw1-by27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:24:47] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s3 on db1154 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[20:27:27] <icinga-wm>	 RECOVERY - BFD status on asw1-by27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:27:33] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:27:37] <icinga-wm>	 RECOVERY - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns3003 is OK: OK: ntpsec.service was restarted after /etc/ntpsec/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[20:27:45] <icinga-wm>	 RECOVERY - BGP status on asw1-by27-esams.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:28:33] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:32:41] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org
[20:33:33] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:34:53] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s3 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[20:35:07] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s3 on db1154 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[20:35:11] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s3 on an-redacteddb1001 is OK: OK slave_sql_lag Replication lag: 0.37 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[20:35:13] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s3 on clouddb1013 is OK: OK slave_sql_lag Replication lag: 0.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[20:35:33] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:38:40] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1198.eqiad.wmnet onto db1212.eqiad.wmnet
[20:41:33] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:42:33] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:44:41] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns3004.wikimedia.org
[20:46:23] <icinga-wm>	 RECOVERY - Host mr1-ulsfo.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 71.54 ms
[20:48:11] <icinga-wm>	 PROBLEM - BFD status on asw1-bw27-esams.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:48:23] <icinga-wm>	 PROBLEM - BGP status on asw1-bw27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:50:33] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:54:27] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[20:54:33] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:54:47] <icinga-wm>	 RECOVERY - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns3004 is OK: OK: ntpsec.service was restarted after /etc/ntpsec/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[20:55:13] <icinga-wm>	 RECOVERY - BFD status on asw1-bw27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:55:23] <icinga-wm>	 RECOVERY - BGP status on asw1-bw27-esams.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:56:02] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1212 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P69528 and previous config saved to /var/cache/conftool/dbconfig/20241009-205601-ladsgroup.json
[20:56:07] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns3004.wikimedia.org
[20:57:27] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[20:57:37] <icinga-wm>	 PROBLEM - NTP peers and stratum check on dns3004 is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown, stratum=-1 (CRITICAL) https://wikitech.wikimedia.org/wiki/NTP
[20:58:17] <icinga-wm>	 RECOVERY - NTP peers and stratum check on dns3004 is OK: NTP OK: Offset -0.000921227 secs, stratum=2 https://wikitech.wikimedia.org/wiki/NTP
[20:58:33] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241009T2100)
[21:01:33] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:04:33] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:07:33] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:08:07] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org
[21:10:33] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:11:07] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1212 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P69529 and previous config saved to /var/cache/conftool/dbconfig/20241009-211107-ladsgroup.json
[21:12:05] <icinga-wm>	 PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:12:07] <icinga-wm>	 PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:15:33] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:16:53] <icinga-wm>	 RECOVERY - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns6001 is OK: OK: ntpsec.service was restarted after /etc/ntpsec/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[21:17:05] <icinga-wm>	 RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:17:07] <icinga-wm>	 RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:18:03] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org
[21:18:33] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:19:33] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:22:45] <mutante>	 !log [apt1002:~] $ sudo -i reprepro --component thirdparty/gitlab-bullseye update bullseye-wikimedia
[21:22:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:23:33] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:26:12] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1212 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P69530 and previous config saved to /var/cache/conftool/dbconfig/20241009-212612-ladsgroup.json
[21:28:38] <wikibugs>	 (03PS3) 10Scott French: echostore: adopt service mesh in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079012 (https://phabricator.wikimedia.org/T376766)
[21:30:03] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns6002.wikimedia.org
[21:32:32] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: security release 20241009
[21:32:33] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:33:23] <icinga-wm>	 PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:34:15] <icinga-wm>	 PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:34:39] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] opensearch: gate curator install [puppet] - 10https://gerrit.wikimedia.org/r/1078970 (https://phabricator.wikimedia.org/T362429) (owner: 10Cwhite)
[21:35:37] <icinga-wm>	 RECOVERY - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns6002 is OK: OK: ntpsec.service was restarted after /etc/ntpsec/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[21:37:20] <logmsgbot>	 dzahn@cumin2002 dzahn: The backup on gitlab1003 is complete, ready to proceed with upgrade.
[21:38:15] <icinga-wm>	 RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:38:23] <icinga-wm>	 RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:41:07] <logmsgbot>	 !log dzahn@cumin2002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: security release 20241009
[21:41:17] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1212 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P69531 and previous config saved to /var/cache/conftool/dbconfig/20241009-214117-ladsgroup.json
[21:42:30] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: security release 20241009
[21:42:36] <logmsgbot>	 !log dzahn@cumin2002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: security release 20241009
[21:44:07] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: security release 20241009
[21:44:14] <logmsgbot>	 !log dzahn@cumin2002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: security release 20241009
[21:44:57] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: security release 20241009
[21:45:16] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns6002.wikimedia.org
[21:45:16] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and (A:esams or A:drmrs) and A:dnsbox
[21:45:42] <logmsgbot>	 !log dzahn@cumin2002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: security release 20241009
[21:47:56] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: security release 20241009-2
[21:48:41] <logmsgbot>	 !log dzahn@cumin2002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: security release 20241009-2
[21:51:33] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:53:13] <wikibugs>	 (03PS1) 10Bking: data-platform: alert on load15 > 32 [alerts] - 10https://gerrit.wikimedia.org/r/1079021 (https://phabricator.wikimedia.org/T376426)
[21:53:33] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:54:15] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: security release 20241009-2
[21:55:04] <logmsgbot>	 !log dzahn@cumin2002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: security release 20241009-2
[21:55:12] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+1] "We've discussed lowering this threshold with the hardcoded value of 32 for now and then in a future patch making it actually look at the n" [alerts] - 10https://gerrit.wikimedia.org/r/1079021 (https://phabricator.wikimedia.org/T376426) (owner: 10Bking)
[21:57:03] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: release 20241009-3
[21:57:33] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:57:48] <logmsgbot>	 !log dzahn@cumin2002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: release 20241009-3
[21:59:33] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:00:24] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: release 20241009-3
[22:01:44] <logmsgbot>	 !log dzahn@cumin2002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1004.wikimedia.org with reason: release 20241009-3
[22:03:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[22:07:50] <jinxer-wm>	 FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=35&orgId=1&from=now-6M&to=now - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh
[22:09:33] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:09:40] <wikibugs>	 (03CR) 10Dzahn: "When using the cookbook today, it failed:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564) (owner: 10Jelto)
[22:10:33] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:10:50] <wikibugs>	 (03PS1) 10Dzahn: Revert "sre.gitlab.upgrade:  also use the service name for the downtime" [cookbooks] - 10https://gerrit.wikimedia.org/r/1079025
[22:10:57] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:11:11] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:11:27] <jinxer-wm>	 FIRING: [2x] CertAlmostExpired: Certificate for service echostore:8082 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#echostore:8082 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[22:11:55] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:12:06] <wikibugs>	 (03CR) 10Dzahn: "reverting to see if it's gone - we need to upgrade" [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564) (owner: 10Jelto)
[22:12:28] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+1] Revert "sre.gitlab.upgrade:  also use the service name for the downtime" [cookbooks] - 10https://gerrit.wikimedia.org/r/1079025 (owner: 10Dzahn)
[22:12:46] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "we need it" [cookbooks] - 10https://gerrit.wikimedia.org/r/1079025 (owner: 10Dzahn)
[22:13:33] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:14:11] <wikibugs>	 (03CR) 10Bking: [C:03+2] data-platform: alert on load15 > 32 [alerts] - 10https://gerrit.wikimedia.org/r/1079021 (https://phabricator.wikimedia.org/T376426) (owner: 10Bking)
[22:15:21] <wikibugs>	 (03Merged) 10jenkins-bot: data-platform: alert on load15 > 32 [alerts] - 10https://gerrit.wikimedia.org/r/1079021 (https://phabricator.wikimedia.org/T376426) (owner: 10Bking)
[22:20:33] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:21:29] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:24:29] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:25:24] <wikibugs>	 (03PS1) 10Dzahn: gerrit: comment out creation of site dir in migration profile [puppet] - 10https://gerrit.wikimedia.org/r/1079026 (https://phabricator.wikimedia.org/T372804)
[22:25:57] <wikibugs>	 (03CR) 10CI reject: [V:04-1] gerrit: comment out creation of site dir in migration profile [puppet] - 10https://gerrit.wikimedia.org/r/1079026 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn)
[22:26:20] <wikibugs>	 (03PS2) 10Dzahn: gerrit: comment out creation of site dir in migration profile [puppet] - 10https://gerrit.wikimedia.org/r/1079026 (https://phabricator.wikimedia.org/T372804)
[22:26:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] gerrit: comment out creation of site dir in migration profile [puppet] - 10https://gerrit.wikimedia.org/r/1079026 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn)
[22:28:13] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: security release 20241009-3
[22:28:33] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:28:43] <jinxer-wm>	 FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[22:28:58] <logmsgbot>	 !log dzahn@cumin2002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1004.wikimedia.org with reason: security release 20241009-3
[22:29:33] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:29:35] <wikibugs>	 (03PS3) 10Dzahn: gerrit: comment out creation of site dir in migration profile [puppet] - 10https://gerrit.wikimedia.org/r/1079026 (https://phabricator.wikimedia.org/T372804)
[22:29:51] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "unexpectedly also did not fix it. now going here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1079026" [puppet] - 10https://gerrit.wikimedia.org/r/1078759 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn)
[22:30:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10215833 (10phaultfinder)
[22:30:56] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: security release 20241009-3
[22:33:43] <jinxer-wm>	 FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[22:35:11] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "yea, after the revert was deployed the cookbook works again." [cookbooks] - 10https://gerrit.wikimedia.org/r/1079025 (owner: 10Dzahn)
[22:35:57] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "was:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1079025 (owner: 10Dzahn)
[22:35:59] <logmsgbot>	 !log dzahn@cumin2002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1004.wikimedia.org with reason: security release 20241009-3
[22:36:49] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+1] "https://puppet-compiler.wmflabs.org/output/1079026/4264/" [puppet] - 10https://gerrit.wikimedia.org/r/1079026 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn)
[22:37:33] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:42:33] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:43:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[22:47:33] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:50:55] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depool for reclone (T375652)', diff saved to https://phabricator.wikimedia.org/P69532 and previous config saved to /var/cache/conftool/dbconfig/20241009-225055-ladsgroup.json
[22:50:58] <stashbot>	 T375652: Wikimedia\Rdbms\DBQueryError: Error 1062: Duplicate entry '1' for key 'PRIMARY' Function: MediaWiki\CheckUser\Services\CheckUserLogService::addLogEntry - https://phabricator.wikimedia.org/T375652
[22:51:15] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.mysql.clone of db1198.eqiad.wmnet onto db1223.eqiad.wmnet
[22:51:55] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[22:51:56] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup1012.eqiad.wmnet with OS bookworm
[22:52:08] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10215864 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host backup1012.eqiad.wmnet with OS bookworm compl...
[22:52:29] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:52:30] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service cloudidm2001-dev:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#cloudidm2001-dev:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[22:53:33] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:55:29] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:56:33] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:58:33] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:01:33] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:02:28] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: security release 20241009
[23:05:29] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:06:26] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:06:33] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:07:30] <logmsgbot>	 !log dzahn@cumin2002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: security release 20241009
[23:07:45] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:08:29] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:09:33] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:10:11] <wikibugs>	 (03CR) 10Scott French: "Thanks for this! No objections to this in the abstract, but I do want to understand the underlying motivation a bit better." [cookbooks] - 10https://gerrit.wikimedia.org/r/912813 (https://phabricator.wikimedia.org/T335364) (owner: 10Clément Goubert)
[23:13:33] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:18:27] <logmsgbot>	 !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host prometheus2008.codfw.wmnet
[23:18:29] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:19:06] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1079026/4264/" [puppet] - 10https://gerrit.wikimedia.org/r/1079026 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn)
[23:20:31] <icinga-wm>	 PROBLEM - Host logging-hd2001 is DOWN: PING CRITICAL - Packet loss = 100%
[23:21:00] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10215958 (10Jclark-ctr)
[23:21:59] <icinga-wm>	 RECOVERY - Host logging-hd2001 is UP: PING OK - Packet loss = 0%, RTA = 30.39 ms
[23:22:03] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10215955 (10Jclark-ctr) 05Open→03Resolved a:05Marostegui→03Jclark-ctr
[23:22:29] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:25:48] <logmsgbot>	 !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2008.codfw.wmnet
[23:26:23] <logmsgbot>	 !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet
[23:27:13] <icinga-wm>	 PROBLEM - Host logging-hd2002 is DOWN: PING CRITICAL - Packet loss = 100%
[23:27:35] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "noop on prod hosts confirmed. fixed puppet run on gerrit2003" [puppet] - 10https://gerrit.wikimedia.org/r/1079026 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn)
[23:29:11] <wikibugs>	 (03CR) 10Dzahn: "this is now unblocked since puppet is unbroken on the new machine and then installed rsync and other things. deploying tomorrow or soon th" [puppet] - 10https://gerrit.wikimedia.org/r/1078752 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn)
[23:29:19] <icinga-wm>	 RECOVERY - Host logging-hd2002 is UP: PING OK - Packet loss = 0%, RTA = 30.40 ms
[23:31:29] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:34:29] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:36:35] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:38:26] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1079035
[23:38:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1079035 (owner: 10TrainBranchBot)
[23:41:28] <logmsgbot>	 !log denisse@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host prometheus1005.eqiad.wmnet
[23:41:29] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:41:33] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:43:56] <logmsgbot>	 !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet
[23:44:31] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:44:40] <wikibugs>	 06SRE, 06collaboration-services, 13Patch-For-Review: setup gerrit2003 with gerrit service - https://phabricator.wikimedia.org/T372804#10215980 (10Dzahn) For the first time puppet runs just fine on the new hardware now, before it is in production.  Also gerrit is deployed there already.  Everything is in plac...
[23:46:09] <wikibugs>	 06SRE, 06collaboration-services, 13Patch-For-Review: setup gerrit2003 with gerrit service - https://phabricator.wikimedia.org/T372804#10215987 (10Dzahn) Notably this also means **gerrit on bookworm ** seems to work.  Since no more puppet issues, app deployed, same Java version.
[23:49:38] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gerrit2003.wikimedia.org
[23:51:17] <logmsgbot>	 !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet
[23:52:14] <logmsgbot>	 !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host prometheus2006.codfw.wmnet
[23:52:33] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:53:33] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status