[00:04:48] FIRING: [66x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [00:08:24] FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:11:45] FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:16:45] FIRING: [6x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:18:52] (03CR) 10RLazarus: [C:03+1] charts: add ingress support to function-evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276873 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [00:18:57] (03CR) 10RLazarus: [C:03+1] services: enable ingress for evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276872 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [00:26:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:39:38] FIRING: [17x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [00:41:45] FIRING: [6x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:44:38] FIRING: [18x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [00:46:45] RESOLVED: [6x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:59:38] FIRING: [18x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [00:59:52] FIRING: [32x] CertAlmostExpired: Certificate for service people1005:30443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:04:38] FIRING: [18x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:08:59] (03Abandoned) 10RLazarus: cache.mcrouter: Add replica.remote_read option [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259222 (https://phabricator.wikimedia.org/T411807) (owner: 10RLazarus) [01:09:03] (03Abandoned) 10RLazarus: cache.mcrouter: Copy 1.3.4 to 1.3.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259221 (owner: 10RLazarus) [01:10:10] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1277221 [01:10:10] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1277221 (owner: 10TrainBranchBot) [01:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:19:46] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1277221 (owner: 10TrainBranchBot) [01:35:20] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T410589)', diff saved to https://phabricator.wikimedia.org/P91520 and previous config saved to /var/cache/conftool/dbconfig/20260425-013520-ladsgroup.json [01:35:25] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [01:39:38] FIRING: [19x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:44:38] FIRING: [19x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:45:28] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P91521 and previous config saved to /var/cache/conftool/dbconfig/20260425-014528-ladsgroup.json [01:49:38] FIRING: [19x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:55:36] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P91522 and previous config saved to /var/cache/conftool/dbconfig/20260425-015535-ladsgroup.json [02:00:59] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:05:41] FIRING: CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-magru (195.200.68.153) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-eqdfw:9804&var-bgp_group=Confed_magru&var-bgp_neighbor=cr2-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [02:05:44] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T410589)', diff saved to https://phabricator.wikimedia.org/P91523 and previous config saved to /var/cache/conftool/dbconfig/20260425-020544-ladsgroup.json [02:05:49] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [02:05:50] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance [02:07:11] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 11s) [02:09:18] FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:10:41] RESOLVED: CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-magru (195.200.68.153) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-eqdfw:9804&var-bgp_group=Confed_magru&var-bgp_neighbor=cr2-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [02:34:18] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:38] FIRING: [16x] CertAlmostExpired: Certificate for service wdqs1015:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:54:38] FIRING: [16x] CertAlmostExpired: Certificate for service wdqs1015:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:14:38] FIRING: [17x] CertAlmostExpired: Certificate for service wdqs1015:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:19:38] FIRING: [16x] CertAlmostExpired: Certificate for service wdqs1015:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:34:03] FIRING: HelmReleaseBadStatus: Helm release mw-script/nngkzgw8 on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [03:44:38] FIRING: [16x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:59:38] FIRING: [18x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:04:48] FIRING: [66x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:08:24] FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:14:38] FIRING: [18x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:19:38] FIRING: [17x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:26:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:27:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:30:12] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:30:41] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:32:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:37:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:47:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:49:38] FIRING: [16x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:52:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:57:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:59:52] FIRING: [32x] CertAlmostExpired: Certificate for service people1005:30443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:04:38] FIRING: [16x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:20:12] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:20:41] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:22:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:27:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:34:38] FIRING: [16x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:39:38] FIRING: [16x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:24:38] FIRING: [18x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:29:38] FIRING: [18x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:34:03] FIRING: HelmReleaseBadStatus: Helm release mw-script/nngkzgw8 on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [07:59:38] FIRING: [16x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:04:48] FIRING: [66x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:08:24] FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:14:38] FIRING: [18x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:26:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:34:38] FIRING: [18x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:39:38] FIRING: [18x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:44:38] FIRING: [14x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:54:38] FIRING: [12x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:59:38] FIRING: [11x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:59:52] FIRING: [32x] CertAlmostExpired: Certificate for service people1005:30443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:24:38] FIRING: [10x] CertAlmostExpired: Certificate for service wdqs2007:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:34:38] FIRING: [6x] CertAlmostExpired: Certificate for service wdqs2007:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:07:53] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS1 Status - issue on wikikube-worker1359:9290 - https://phabricator.wikimedia.org/T424396#11857963 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [11:34:03] FIRING: HelmReleaseBadStatus: Helm release mw-script/nngkzgw8 on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:04:38] FIRING: [6x] CertAlmostExpired: Certificate for service wdqs1013:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:04:48] FIRING: [66x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:08:24] FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:09:38] FIRING: [8x] CertAlmostExpired: Certificate for service wdqs1013:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:14:38] FIRING: [9x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:24:38] FIRING: [15x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:26:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:29:21] PROBLEM - Host db1244 #page is DOWN: PING CRITICAL - Packet loss = 100% [12:30:16] here [12:30:32] !incidents [12:30:32] 7863 (UNACKED) Host db1244 (paged) [12:30:38] !ack 7863 [12:30:38] 7863 (ACKED) Host db1244 (paged) [12:31:47] candidate master for s4 [12:33:12] Amir1: create a task please and we'll handle it on Monday [12:33:23] yup on it [12:33:32] Thanks! [12:33:50] It may come back up in a bit if it rebooted itself [12:34:38] FIRING: [21x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:34:54] it actually went down exactly 1d ago. It seems it was part of a reboot that didn't come back [12:35:07] https://wikitech.wikimedia.org/wiki/Map_of_database_maintenance I see s4 for maint yesterday [12:35:15] here [12:35:34] heh, too late :D [12:36:00] Amir1: right, if you don't mind creating a task for it so federico3 can double check it as that's part if his reboots [12:36:10] 10ops-eqiad, 06DBA, 06DC-Ops: db1244 didn't come back online - https://phabricator.wikimedia.org/T424423 (10Ladsgroup) 03NEW [12:36:52] okay, I'm going to remove notification and resolve the page since it'll page us tomorrow [12:37:21] 10ops-eqiad, 06DBA, 06DC-Ops: db1244 didn't come back online - https://phabricator.wikimedia.org/T424423#11858042 (10Marostegui) p:05Triage→03High a:03FCeratto-WMF This is candidate master for s4 so please treat it with high priority [12:37:30] Amir1: thanks! [12:38:32] (03PS1) 10Ladsgroup: db1244: disable notification [puppet] - 10https://gerrit.wikimedia.org/r/1277229 (https://phabricator.wikimedia.org/T424423) [12:39:38] FIRING: [23x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:39:58] (03CR) 10Ladsgroup: [C:03+2] db1244: disable notification [puppet] - 10https://gerrit.wikimedia.org/r/1277229 (https://phabricator.wikimedia.org/T424423) (owner: 10Ladsgroup) [12:41:30] manually resolved it so it doesn't page in 24h [12:49:38] FIRING: [24x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:59:52] FIRING: [32x] CertAlmostExpired: Certificate for service people1005:30443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:20:42] indeed the host was rebooted by the rolling restart script in the afternoon and then the silence timed out now. I'll chase the failed reboot on monday [13:40:25] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1244 didn't come back online - https://phabricator.wikimedia.org/T424423#11858136 (10Jclark-ctr) I will take a look remotely shortly I am unable to go in right now is this urgent requiring someone to go on site this weekend? [13:41:40] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1244 didn't come back online - https://phabricator.wikimedia.org/T424423#11858139 (10Marostegui) Absolutely no need to even check it remotely today!! Thank you so much John for the response. This can wait until Monday! [14:22:40] aaa/go _ale [14:33:45] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1244 didn't come back online - https://phabricator.wikimedia.org/T424423#11858169 (10Jclark-ctr) ` Date and Time The System Configuration Check operation resulted in multiple Riser issues. Tue Apr 14 2026 22:59:35 The System Configuration Check operation resulted in... [14:39:38] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1244 didn't come back online - https://phabricator.wikimedia.org/T424423#11858174 (10Jclark-ctr) Updating Firmware on Idrac , Bios , Expander backplane [14:39:38] FIRING: [24x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:40:01] RECOVERY - Host db1244 #page is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [14:40:17] PROBLEM - mysqld processes #page on db1244 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [14:40:21] PROBLEM - MariaDB Replica IO: s4 #page on db1244 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:40:21] PROBLEM - MariaDB Replica Lag: s4 #page on db1244 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:40:22] PROBLEM - MariaDB Replica SQL: s4 #page on db1244 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:40:58] PROBLEM - MariaDB Events s4 on db1244 is CRITICAL: CRITICAL - Failed to query events: ERROR 2002 (HY000): Cant connect to local server through socket /run/mysqld/mysqld.sock (2) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [14:40:58] PROBLEM - MariaDB Event Scheduler s4 on db1244 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [14:40:58] PROBLEM - MariaDB read only s4 on db1244 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:41:31] here [14:42:03] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1244 didn't come back online - https://phabricator.wikimedia.org/T424423#11858175 (10Jclark-ctr) Server is back up right now but will be rebooting a few times during updates. [14:42:11] I assume that's just follow-up from the previous? [14:42:14] !ack [14:42:15] 7864 (ACKED) db1244 (paged)/mysqld processes (paged) [14:42:15] 7865 (ACKED) db1244 (paged)/MariaDB Replica Lag: s4 (paged) [14:42:15] 7866 (ACKED) db1244 (paged)/MariaDB Replica IO: s4 (paged) [14:42:15] 7867 (ACKED) db1244 (paged)/MariaDB Replica SQL: s4 (paged) [14:42:44] (as in, no action needed, will self-resolve?) [14:42:46] here now [14:42:50] sigh [14:42:59] o/ [14:44:33] PROBLEM - Host db1244 #page is DOWN: PING CRITICAL - Packet loss = 100% [14:44:50] ._. [14:46:31] Raine: it came back up and then down [14:46:59] I just saw jclark-ctr commenting on the task. He is checking it [14:47:06] thanks, yeah [14:47:09] Raine: can you just downtime it for 7 days? [14:47:17] sure, on it, thanks marostegui [14:47:23] And then we will handle it on Monday [14:47:24] Multiple configuration related issues on the device Riser are resolved. it is doing some updates right now. [14:47:32] so might reboot 1 more time [14:47:36] Thanks jclark-ctr [14:49:58] !log kamila@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db1244.eqiad.wmnet with reason: flaky host [14:49:59] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1244 didn't come back online - https://phabricator.wikimedia.org/T424423#11858188 (10Jclark-ctr) ` Multiple configuration related issues on the device Riser are resolved. Sat Apr 25 2026 14:43:09 A configuration related issue on the device Riser is resolved. Sat Apr 2... [14:50:04] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1244 didn't come back online - https://phabricator.wikimedia.org/T424423#11858189 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ce0eb665-5f73-4895-a5e4-4007580075ae) set by kamila@cumin1003 for 7 days, 0:00:00 on 1 host(s) and their services with... [14:51:38] !incidents [14:51:38] 7864 (ACKED) db1244 (paged)/mysqld processes (paged) [14:51:39] 7865 (ACKED) db1244 (paged)/MariaDB Replica Lag: s4 (paged) [14:51:39] 7866 (ACKED) db1244 (paged)/MariaDB Replica IO: s4 (paged) [14:51:39] 7867 (ACKED) db1244 (paged)/MariaDB Replica SQL: s4 (paged) [14:51:39] 7868 (ACKED) Host db1244 (paged) [14:51:39] 7863 (RESOLVED) Host db1244 (paged) [14:51:46] !resolve [14:51:46] 7864 (RESOLVED) db1244 (paged)/mysqld processes (paged) [14:51:47] 7865 (RESOLVED) db1244 (paged)/MariaDB Replica Lag: s4 (paged) [14:51:47] 7866 (RESOLVED) db1244 (paged)/MariaDB Replica IO: s4 (paged) [14:51:47] 7867 (RESOLVED) db1244 (paged)/MariaDB Replica SQL: s4 (paged) [14:51:47] 7868 (RESOLVED) Host db1244 (paged) [14:51:54] reboots should be finished. idrac is updating right now. [14:51:59] no reboot needed for idrac [14:52:29] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1244 didn't come back online - https://phabricator.wikimedia.org/T424423#11858207 (10Jclark-ctr) [14:56:20] I can try to star mariadb a bit later just to make it catch up [15:01:46] @raine @amir1 i am finished with updates. [15:04:38] FIRING: [23x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:18:59] Thanks! [15:29:01] Amir1: I just started it [15:29:38] FIRING: [21x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:30:01] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1244 didn't come back online - https://phabricator.wikimedia.org/T424423#11858231 (10Marostegui) 05Open→03Resolved a:05FCeratto-WMF→03Jclark-ctr MariaDB started on db1244 - on Monday we can reenable notifications and pool it back. Thanks John for all the h... [15:30:10] (03PS1) 10Marostegui: Revert "db1244: disable notification" [puppet] - 10https://gerrit.wikimedia.org/r/1277235 [15:30:20] (03CR) 10Marostegui: [C:04-2] "Needs to wait till Monday." [puppet] - 10https://gerrit.wikimedia.org/r/1277235 (owner: 10Marostegui) [15:34:03] FIRING: HelmReleaseBadStatus: Helm release mw-script/nngkzgw8 on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:34:38] FIRING: [19x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:49:38] FIRING: [15x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:04:48] FIRING: [66x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:08:24] FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:09:19] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:24:38] FIRING: [13x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:26:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:34:19] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:56:46] PROBLEM - Host mr1-magru.oob is DOWN: PING CRITICAL - Packet loss = 100% [16:59:52] FIRING: [32x] CertAlmostExpired: Certificate for service people1005:30443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:07:00] RECOVERY - Host mr1-magru.oob is UP: PING OK - Packet loss = 0%, RTA = 117.46 ms [17:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:39:38] FIRING: [12x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:29:38] FIRING: [14x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:39:38] FIRING: [16x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:49:38] FIRING: [18x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:54:38] FIRING: [22x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:09:38] FIRING: [22x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:34:03] FIRING: HelmReleaseBadStatus: Helm release mw-script/nngkzgw8 on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [19:34:38] FIRING: [16x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:39:38] FIRING: [15x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [20:04:48] FIRING: [66x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [20:08:24] FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:19:38] FIRING: [10x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [20:26:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:59:52] FIRING: [32x] CertAlmostExpired: Certificate for service people1005:30443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:44:38] FIRING: [10x] CertAlmostExpired: Certificate for service wdqs1011:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:09:38] FIRING: [12x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:39:38] FIRING: [16x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:59:38] FIRING: [16x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:04:38] FIRING: [14x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:34:03] FIRING: HelmReleaseBadStatus: Helm release mw-script/nngkzgw8 on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [23:39:43] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1277247 [23:39:43] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1277247 (owner: 10TrainBranchBot) [23:51:08] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1277247 (owner: 10TrainBranchBot)