[00:07:32] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [00:08:25] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1198608 [00:08:25] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1198608 (owner: 10TrainBranchBot) [00:08:30] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30039 bytes in 7.583 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [00:25:27] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/5 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [00:32:29] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1198608 (owner: 10TrainBranchBot) [00:55:27] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:00:32] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [01:00:44] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:01:22] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30037 bytes in 0.233 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [01:13:54] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 09s) [01:40:27] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:44:00] PROBLEM - SSH on bast7002 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:44:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:44:58] RECOVERY - SSH on bast7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:45:27] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:53:39] FIRING: TransitBGPDown: Transit BGP session down between cr2-magru and Hurricane Electric (187.16.221.197) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Transit4&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [01:58:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-magru and Hurricane Electric (187.16.221.197) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [02:08:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-magru and Hurricane Electric (187.16.221.197) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [02:20:27] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:42:32] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [02:43:24] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30030 bytes in 0.924 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [02:52:32] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [02:58:32] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30032 bytes in 7.982 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [03:20:27] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-eqiad:et-0/0/30 (Core: ssw1-d8-eqiad:ethernet-1/30 {#}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-e1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:30:27] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [03:39:34] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [03:44:24] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30040 bytes in 0.457 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [03:50:27] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [04:25:27] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/5 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:55:27] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:05:27] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:09:03] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:09:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:30:27] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:34:03] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:45:27] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:19:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:20:27] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:24:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:15:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [07:20:27] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-eqiad:et-0/0/30 (Core: ssw1-d8-eqiad:ethernet-1/30 {#}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-e1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:20:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [07:30:27] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:50:27] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [07:57:13] (03PS1) 10Elukey: role::maps::master_bookworm: enable tile invalidation in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1198614 (https://phabricator.wikimedia.org/T381565) [07:57:29] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1198614 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [08:00:57] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11308924 (10elukey) Bootstrap completed, and it looks good: ` root@thanos-fe1004:~# swift stat tegola-swift-codfw-v003 | grep Objects Objects: 95190783 root@tha... [08:01:25] (03PS2) 10Elukey: role::maps::master_bookworm: enable tile invalidation in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1198614 (https://phabricator.wikimedia.org/T381565) [08:01:39] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1198614 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [08:04:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:04:58] (03CR) 10Elukey: [C:03+2] role::maps::master_bookworm: enable tile invalidation in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1198614 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [08:05:27] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:14:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:15:27] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:25:27] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/5 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:34:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:35:27] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:44:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:45:27] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:55:27] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:20:27] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:24:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:40:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:45:27] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:49:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [09:53:38] (03CR) 10Jgiannelos: [C:04-1] "Blocked by https://phabricator.wikimedia.org/T408222" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198537 (https://phabricator.wikimedia.org/T278481) (owner: 10Jgiannelos) [09:54:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:54:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [09:55:27] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:20:27] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:23:37] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: Degraded RAID on an-presto1013 - https://phabricator.wikimedia.org/T408065#11308977 (10Jclark-ctr) @BTullis Sorry, I typed out a response earlier but forgot to post it. Unfortunately, we do not have any 4TB SAS... [10:44:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:45:27] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:14:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:15:27] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:19:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:20:27] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-eqiad:et-0/0/30 (Core: ssw1-d8-eqiad:ethernet-1/30 {#}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-e1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [11:22:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 28 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) (owner: 10Əkrəm) [11:24:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:30:27] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:40:34] PROBLEM - Host ms-be1090 is DOWN: PING CRITICAL - Packet loss = 100% [11:50:27] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [12:10:46] (03PS1) 10Bunnypranav: core-Namespaces: Add R: and R_talk: NS for crhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198626 (https://phabricator.wikimedia.org/T408284) [12:25:27] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/5 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:29:34] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [12:30:24] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30029 bytes in 0.208 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [12:55:27] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:40:40] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:45:27] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:20:27] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired