[00:02:15] PROBLEM - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is CRITICAL: CRITICAL: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics_product AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [00:03:57] PROBLEM - SSH on puppetserver1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:04:09] PROBLEM - SSH on puppetserver2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:04:15] RECOVERY - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics_product AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [00:04:27] PROBLEM - SSH on puppetserver1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:08:14] FIRING: JobUnavailable: Reduced availability for job jmx_puppetserver in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:11:17] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1076046 (owner: 10TrainBranchBot) [00:13:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:16:47] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver1003 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:18:25] FIRING: SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:22:55] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver2002 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:23:17] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver2003 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:23:25] FIRING: [3x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:24:56] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:28:57] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:32:55] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver2002 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:32:59] RECOVERY - SSH on puppetserver2001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:33:17] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver2003 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:33:25] RESOLVED: [3x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:36:47] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver1003 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:41:57] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver2001 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:43:55] FIRING: [4x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:43:55] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver2002 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:43:59] FIRING: [4x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:44:17] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver2003 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:53:05] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:53:11] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:53:55] FIRING: [4x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:53:55] FIRING: [4x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:53:55] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver2002 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:54:17] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver2003 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:54:57] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:55:47] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 10 Dec 2024 11:59:32 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:55:48] RESOLVED: PuppetFailure: Puppet has failed on parsoidtest1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [00:56:28] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [00:57:47] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver1003 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:58:55] FIRING: [4x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:58:57] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:58:59] FIRING: [4x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:03:55] FIRING: [4x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:04:04] FIRING: [4x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:04:55] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver2002 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:05:17] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver2003 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:05:47] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 10 Dec 2024 11:59:32 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:05:55] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:06:01] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52629 bytes in 0.097 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:07:47] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver1003 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:08:55] FIRING: [4x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:08:55] FIRING: [4x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:13:55] FIRING: [4x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:13:55] FIRING: [4x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:15:17] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver2003 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:18:47] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver1003 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:18:55] FIRING: [4x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:18:55] FIRING: [4x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:18:57] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:19:56] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:22:27] (03CR) 10Ssingh: varnish: Give 1% of views RSA cert warnings (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [01:23:55] FIRING: [4x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:23:55] FIRING: [4x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:26:17] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver2003 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:28:55] FIRING: [4x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:28:55] FIRING: [4x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:33:55] FIRING: [4x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:33:59] FIRING: [4x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:37:49] FIRING: PuppetFailure: Puppet has failed on mw1478:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:37:49] FIRING: PuppetFailure: Puppet has failed on config-master1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:38:55] FIRING: [4x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:38:59] FIRING: [4x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:39:19] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 62, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:39:48] FIRING: PuppetFailure: Puppet has failed on mc-wf1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:39:49] FIRING: PuppetFailure: Puppet has failed on ml-serve1009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:39:53] FIRING: PuppetFailure: Puppet has failed on restbase1036:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:40:19] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:41:48] FIRING: PuppetFailure: Puppet has failed on aux-k8s-etcd1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:41:49] FIRING: PuppetFailure: Puppet has failed on elastic1055:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:42:49] FIRING: [12x] PuppetFailure: Puppet has failed on kubernetes1008:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:43:49] FIRING: PuppetFailure: Puppet has failed on rdb1013:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:43:53] FIRING: PuppetFailure: Puppet has failed on ganeti1026:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:44:06] FIRING: [4x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:44:15] FIRING: [4x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:44:49] FIRING: [2x] PuppetFailure: Puppet has failed on ml-serve1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:44:49] FIRING: [2x] PuppetFailure: Puppet has failed on restbase1036:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:44:49] FIRING: PuppetFailure: Puppet has failed on pki-root1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:44:57] FIRING: PuppetFailure: Puppet has failed on ganeti1039:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:45:09] FIRING: PuppetFailure: Puppet has failed on wdqs1018:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:45:13] FIRING: PuppetFailure: Puppet has failed on kubestage1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:46:49] FIRING: [5x] PuppetFailure: Puppet has failed on elastic1055:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:47:49] FIRING: [2x] PuppetFailure: Puppet has failed on config-master1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:48:05] FIRING: [24x] PuppetFailure: Puppet has failed on kubernetes1008:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:48:53] FIRING: [2x] PuppetFailure: Puppet has failed on ganeti1026:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:48:57] FIRING: PuppetFailure: Puppet has failed on ml-etcd1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:49:06] FIRING: [4x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:49:15] FIRING: [4x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:49:49] FIRING: [3x] PuppetFailure: Puppet has failed on mc-wf1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:49:49] FIRING: [4x] PuppetFailure: Puppet has failed on restbase1032:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:50:49] FIRING: PuppetFailure: Puppet has failed on wdqs1016:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:51:48] FIRING: [6x] PuppetFailure: Puppet has failed on elastic1055:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:52:48] FIRING: [3x] PuppetFailure: Puppet has failed on config-master1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:53:05] FIRING: [36x] PuppetFailure: Puppet has failed on kubernetes1008:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:53:19] FIRING: PuppetZeroResources: Puppet has failed generate resources on elastic1056:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [01:53:53] FIRING: PuppetFailure: Puppet has failed on wcqs1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:53:55] RECOVERY - SSH on puppetserver1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:54:01] FIRING: [4x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:54:11] FIRING: [4x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:54:29] RESOLVED: JobUnavailable: Reduced availability for job jmx_puppetserver in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:54:48] FIRING: [5x] PuppetFailure: Puppet has failed on mc-wf1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:54:49] FIRING: [5x] PuppetFailure: Puppet has failed on restbase1032:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:54:53] FIRING: [3x] PuppetFailure: Puppet has failed on ganeti1039:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:54:57] RESOLVED: PuppetFailure: Puppet has failed on wdqs1018:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:55:53] RESOLVED: PuppetFailure: Puppet has failed on wdqs1016:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:55:57] FIRING: PuppetFailure: Puppet has failed on mc-gp1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:56:09] FIRING: PuppetFailure: Puppet has failed on wdqs1012:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:56:49] FIRING: [10x] PuppetFailure: Puppet has failed on elastic1055:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:57:53] FIRING: [52x] PuppetFailure: Puppet has failed on kubernetes1008:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:58:53] FIRING: [4x] PuppetFailure: Puppet has failed on ganeti1018:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:59:10] FIRING: [2x] PuppetFailure: Puppet has failed on kafka-main1006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:59:14] RESOLVED: PuppetFailure: Puppet has failed on wcqs1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:59:23] FIRING: [4x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:59:32] FIRING: [4x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:59:41] FIRING: [4x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:59:56] FIRING: [4x] PuppetFailure: Puppet has failed on ml-serve-ctrl1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:00:01] FIRING: [6x] PuppetFailure: Puppet has failed on mc-wf1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:00:09] FIRING: [6x] PuppetFailure: Puppet has failed on restbase1032:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:00:30] FIRING: PuppetFailure: Puppet has failed on chartmuseum1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:00:38] FIRING: PuppetFailure: Puppet has failed on cloudelastic1009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:00:47] FIRING: [4x] PuppetFailure: Puppet has failed on wdqs1012:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:01:06] RESOLVED: PuppetFailure: Puppet has failed on mc-gp1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:01:53] FIRING: [13x] PuppetFailure: Puppet has failed on elastic1055:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:02:49] FIRING: [5x] PuppetFailure: Puppet has failed on config-master1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:03:05] FIRING: [61x] PuppetFailure: Puppet has failed on kubernetes1008:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:03:53] FIRING: [5x] PuppetFailure: Puppet has failed on ganeti1018:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:04:01] RESOLVED: [2x] PuppetFailure: Puppet has failed on kafka-main1006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:04:10] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:04:12] (03PS1) 10BPirkle: Make specs module available on beta and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076058 (https://phabricator.wikimedia.org/T375512) [02:04:19] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:04:49] FIRING: [7x] PuppetFailure: Puppet has failed on mc-wf1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:04:49] FIRING: [9x] PuppetFailure: Puppet has failed on restbase1030:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:05:01] FIRING: [5x] PuppetFailure: Puppet has failed on ganeti1039:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:05:18] RESOLVED: PuppetFailure: Puppet has failed on chartmuseum1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:05:27] FIRING: [5x] PuppetFailure: Puppet has failed on wdqs1012:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:06:13] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver1002 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:06:49] FIRING: [15x] PuppetFailure: Puppet has failed on elastic1055:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:07:48] FIRING: [6x] PuppetFailure: Puppet has failed on config-master1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:08:05] FIRING: [62x] PuppetFailure: Puppet has failed on kubernetes1014:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:08:14] FIRING: [2x] PuppetZeroResources: Puppet has failed generate resources on elastic1056:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:08:49] FIRING: [6x] PuppetFailure: Puppet has failed on ganeti1018:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:08:49] RESOLVED: PuppetFailure: Puppet has failed on rdb1013:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:09:01] FIRING: [2x] PuppetFailure: Puppet has failed on ml-etcd1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:09:06] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:09:15] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:09:48] FIRING: [3x] PuppetFailure: Puppet has failed on ml-serve-ctrl1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:09:49] FIRING: [6x] PuppetFailure: Puppet has failed on mc-wf1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:09:53] FIRING: [9x] PuppetFailure: Puppet has failed on restbase1030:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:10:14] RESOLVED: PuppetFailure: Puppet has failed on pki-root1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:11:49] RESOLVED: PuppetFailure: Puppet has failed on aux-k8s-etcd1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:11:57] FIRING: [15x] PuppetFailure: Puppet has failed on elastic1055:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:12:49] FIRING: [6x] PuppetFailure: Puppet has failed on config-master1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:13:01] FIRING: [63x] PuppetFailure: Puppet has failed on kubernetes1014:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:13:10] RESOLVED: [2x] PuppetZeroResources: Puppet has failed generate resources on elastic1056:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:13:53] FIRING: [6x] PuppetFailure: Puppet has failed on ganeti1018:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:14:04] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:14:13] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:14:49] FIRING: [6x] PuppetFailure: Puppet has failed on mc-wf1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:14:49] FIRING: [9x] PuppetFailure: Puppet has failed on restbase1030:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:15:02] RESOLVED: PuppetFailure: Puppet has failed on kubestage1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:15:06] FIRING: [5x] PuppetFailure: Puppet has failed on ganeti1039:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:16:13] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver1002 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:16:49] FIRING: [15x] PuppetFailure: Puppet has failed on elastic1055:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:17:53] FIRING: [63x] PuppetFailure: Puppet has failed on kubernetes1014:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:18:49] RESOLVED: [6x] PuppetFailure: Puppet has failed on ganeti1018:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:19:01] RESOLVED: [2x] PuppetFailure: Puppet has failed on ml-etcd1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:19:10] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:19:19] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:19:28] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:19:49] RESOLVED: [4x] PuppetFailure: Puppet has failed on ml-serve-ctrl1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:19:49] FIRING: [9x] PuppetFailure: Puppet has failed on restbase1030:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:19:57] FIRING: [5x] PuppetFailure: Puppet has failed on ganeti1039:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:20:10] RESOLVED: PuppetFailure: Puppet has failed on cloudelastic1009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:20:14] FIRING: [4x] PuppetFailure: Puppet has failed on wdqs1012:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:21:43] (03CR) 10BPirkle: "Neglected to activate the specs module in the Thursday 9/26 deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076058 (https://phabricator.wikimedia.org/T375512) (owner: 10BPirkle) [02:21:49] FIRING: [11x] PuppetFailure: Puppet has failed on elastic1055:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:22:57] FIRING: [51x] PuppetFailure: Puppet has failed on kubernetes1014:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:23:08] FIRING: [3x] PuppetZeroResources: Puppet has failed generate resources on elastic1056:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:23:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [02:23:55] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:23:59] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:24:48] RESOLVED: [4x] PuppetFailure: Puppet has failed on mc-wf1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:24:49] RESOLVED: [7x] PuppetFailure: Puppet has failed on restbase1030:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:25:02] RESOLVED: [5x] PuppetFailure: Puppet has failed on ganeti1039:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:25:14] RESOLVED: [4x] PuppetFailure: Puppet has failed on wdqs1012:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:26:49] RESOLVED: [7x] PuppetFailure: Puppet has failed on elastic1055:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:27:13] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver1002 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:27:49] RESOLVED: [2x] PuppetFailure: Puppet has failed on config-master1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:27:49] RESOLVED: [34x] PuppetFailure: Puppet has failed on kubernetes1014:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:28:04] RESOLVED: [2x] PuppetZeroResources: Puppet has failed generate resources on elastic1098:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:28:55] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:28:55] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:33:55] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:33:55] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:34:55] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver2002 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:37:13] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver1002 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:38:55] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:38:59] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:39:10] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:42:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [02:43:55] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:43:55] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:45:55] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver2002 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:47:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [02:48:13] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver1002 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:48:55] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:48:55] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:53:55] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:53:59] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:58:13] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver1002 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:58:55] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:58:59] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:03:55] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:03:55] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:08:47] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver1003 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:08:55] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:08:59] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:09:13] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver1002 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:13:55] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:13:59] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:16:17] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver2003 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:18:55] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:18:55] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:19:13] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver1002 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:19:47] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver1003 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:22:35] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: puppetserver[1001-1002,2001] crashed on 2024-09-27 00:00 - https://phabricator.wikimedia.org/T375839 (10Vgutierrez) 03NEW [03:23:55] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:23:55] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:26:31] !log powercycle puppetserver1001 - T375839 [03:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:26:37] T375839: puppetserver[1001-1002,2001] crashed on 2024-09-27 00:00 - https://phabricator.wikimedia.org/T375839 [03:27:17] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver2003 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:28:55] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:28:55] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:29:35] RECOVERY - SSH on puppetserver1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:30:13] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver1002 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:32:51] (03PS1) 10Vgutierrez: hiera: Add cluster definition to puppetserver role [puppet] - 10https://gerrit.wikimedia.org/r/1076074 [03:33:55] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:33:55] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:34:10] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:35:54] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: puppetserver[1001-1002,2001] crashed on 2024-09-27 00:00 - https://phabricator.wikimedia.org/T375839#10181952 (10Vgutierrez) p:05Triage→03High after a powercycle puppetserver1001 is responsive again [03:35:55] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver2002 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:37:17] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver2003 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:38:55] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:38:59] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:40:13] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver1002 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:40:48] FIRING: PuppetFailure: Puppet has failed on parsoidtest1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:41:57] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver2001 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:48:55] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:48:55] FIRING: [6x] SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:49:47] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver1003 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:56:28] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [05:19:56] FIRING: SystemdUnitFailed: wmf_auto_restart_envoyproxy.service on parsoidtest1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240927T0600) [06:36:38] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1076074 (owner: 10Vgutierrez) [06:37:12] (03CR) 10Vgutierrez: [C:03+2] hiera: Add cluster definition to puppetserver role [puppet] - 10https://gerrit.wikimedia.org/r/1076074 (owner: 10Vgutierrez) [06:42:37] (03PS1) 10Muehlenhoff: Send output of daily account check to SRE IF alias only [puppet] - 10https://gerrit.wikimedia.org/r/1076091 [06:42:49] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1240.eqiad.wmnet [06:42:51] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1240.eqiad.wmnet [06:42:56] (03CR) 10CI reject: [V:04-1] Send output of daily account check to SRE IF alias only [puppet] - 10https://gerrit.wikimedia.org/r/1076091 (owner: 10Muehlenhoff) [06:42:57] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1241.eqiad.wmnet [06:42:58] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1241.eqiad.wmnet [06:43:04] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1242.eqiad.wmnet [06:43:05] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1242.eqiad.wmnet [06:43:11] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1243.eqiad.wmnet [06:43:12] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1243.eqiad.wmnet [06:43:18] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1244.eqiad.wmnet [06:43:19] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1244.eqiad.wmnet [06:43:25] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1245.eqiad.wmnet [06:43:27] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1245.eqiad.wmnet [06:43:33] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1246.eqiad.wmnet [06:43:34] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1246.eqiad.wmnet [06:43:40] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1247.eqiad.wmnet [06:43:42] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1247.eqiad.wmnet [06:44:13] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker1245.eqiad.wmnet [06:44:15] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker1245.eqiad.wmnet [06:44:25] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker1247.eqiad.wmnet [06:44:27] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker1247.eqiad.wmnet [06:47:38] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1249.eqiad.wmnet [06:47:39] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1249.eqiad.wmnet [06:47:46] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1250.eqiad.wmnet [06:47:51] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1250.eqiad.wmnet [06:47:58] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1252.eqiad.wmnet [06:47:59] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1252.eqiad.wmnet [06:48:05] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1253.eqiad.wmnet [06:48:06] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1253.eqiad.wmnet [06:48:12] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1254.eqiad.wmnet [06:48:14] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1254.eqiad.wmnet [06:48:16] (03PS2) 10Muehlenhoff: Send output of daily account check to SRE IF alias only [puppet] - 10https://gerrit.wikimedia.org/r/1076091 [06:48:20] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1257.eqiad.wmnet [06:48:21] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1257.eqiad.wmnet [06:48:24] (03CR) 10Slyngshede: Send output of daily account check to SRE IF alias only (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1076091 (owner: 10Muehlenhoff) [06:48:27] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1258.eqiad.wmnet [06:48:29] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1258.eqiad.wmnet [06:48:35] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1259.eqiad.wmnet [06:48:36] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1259.eqiad.wmnet [06:48:42] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1260.eqiad.wmnet [06:48:44] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1260.eqiad.wmnet [06:48:50] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1262.eqiad.wmnet [06:48:51] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1262.eqiad.wmnet [06:48:57] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1264.eqiad.wmnet [06:48:59] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1264.eqiad.wmnet [06:49:05] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1265.eqiad.wmnet [06:49:07] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1265.eqiad.wmnet [06:49:13] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1270.eqiad.wmnet [06:49:14] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1270.eqiad.wmnet [06:49:20] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1271.eqiad.wmnet [06:49:22] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1271.eqiad.wmnet [06:49:28] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1273.eqiad.wmnet [06:49:29] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1273.eqiad.wmnet [06:49:35] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1274.eqiad.wmnet [06:49:37] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1274.eqiad.wmnet [06:49:43] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1278.eqiad.wmnet [06:49:45] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1278.eqiad.wmnet [06:49:51] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1279.eqiad.wmnet [06:49:53] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1279.eqiad.wmnet [06:49:59] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1281.eqiad.wmnet [06:50:00] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1281.eqiad.wmnet [06:50:07] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1282.eqiad.wmnet [06:50:08] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1282.eqiad.wmnet [06:50:14] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1284.eqiad.wmnet [06:50:15] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1284.eqiad.wmnet [06:50:21] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1286.eqiad.wmnet [06:50:23] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1286.eqiad.wmnet [06:50:29] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1289.eqiad.wmnet [06:50:31] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1289.eqiad.wmnet [06:50:37] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1290.eqiad.wmnet [06:50:38] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1290.eqiad.wmnet [06:50:44] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1294.eqiad.wmnet [06:50:46] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1294.eqiad.wmnet [06:50:52] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1296.eqiad.wmnet [06:50:53] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1296.eqiad.wmnet [06:50:59] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1301.eqiad.wmnet [06:51:01] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1301.eqiad.wmnet [06:51:07] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1303.eqiad.wmnet [06:51:13] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1303.eqiad.wmnet [06:51:19] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1304.eqiad.wmnet [06:51:20] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1304.eqiad.wmnet [06:56:11] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: puppetserver[1001-1002,2001] crashed on 2024-09-27 00:00 - https://phabricator.wikimedia.org/T375839#10182037 (10MoritzMuehlenhoff) →14Duplicate dup:03T373527 [06:56:41] 06SRE, 06Infrastructure-Foundations: puppetserver1002 thrashing and requiring a power cycle as a result - https://phabricator.wikimedia.org/T373527#10182039 (10MoritzMuehlenhoff) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240927T0700) [07:12:07] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [07:12:08] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1352.eqiad.wmnet [07:12:28] !log depooling mw1352.eqiad.wmnet mw1353.eqiad.wmnet mw1354.eqiad.wmnet mw1355.eqiad.wmnet mw1356.eqiad.wmnet mw1357.eqiad.wmnet mw1360.eqiad.wmnet mw1361.eqiad.wmnet mw1362.eqiad.wmnet mw1363.eqiad.wmnet mw1367.eqiad.wmnet mw1368.eqiad.wmnet mw1369.eqiad.wmnet mw1370.eqiad.wmnet mw1371.eqiad.wmnet mw1374.eqiad.wmnet mw1375.eqiad.wmnet [07:12:28] mw1376.eqiad.wmnet mw1377.eqiad.wmnet mw1378.eqiad.wmnet mw1379.eqiad.wmnet mw1380.eqiad.wmnet mw1381.eqiad.wmnet mw1382.eqiad.wmnet mw1383.eqiad.wmnet mw1384.eqiad.wmnet mw1385.eqiad.wmnet mw1386.eqiad.wmnet mw1387.eqiad.wmnet mw1388.eqiad.wmnet mw1389.eqiad.wmnet mw1390.eqiad.wmnet mw1391.eqiad.wmnet mw1392.eqiad.wmnet mw1393.eqiad.wmnet [07:12:28] mw1394.eqiad.wmnet mw1395.eqiad.wmnet mw1396.eqiad.wmnet mw1397.eqiad.wmnet mw1399.eqiad.wmnet mw1405.eqiad.wmnet mw1408.eqiad.wmnet mw1409.eqiad.wmnet T375842 [07:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:32] T375842: decommission mw[1349-1413] - https://phabricator.wikimedia.org/T375842 [07:12:42] ok let's format this better [07:12:44] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1352.eqiad.wmnet [07:12:50] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1353.eqiad.wmnet [07:13:03] !log T375842 depooling mw1352.eqiad.wmnet mw1353.eqiad.wmnet mw1354.eqiad.wmnet mw1355.eqiad.wmnet mw1356.eqiad.wmnet mw1357.eqiad.wmnet mw1360.eqiad.wmnet mw1361.eqiad.wmnet mw1362.eqiad.wmnet mw1363.eqiad.wmnet mw1367.eqiad.wmnet mw1368.eqiad.wmnet mw1369.eqiad.wmnet mw1370.eqiad.wmnet [07:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:26] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1353.eqiad.wmnet [07:13:32] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1354.eqiad.wmnet [07:13:35] !log T375842 depooling mw1371.eqiad.wmnet mw1374.eqiad.wmnet mw1375.eqiad.wmnet mw1376.eqiad.wmnet mw1377.eqiad.wmnet mw1378.eqiad.wmnet mw1379.eqiad.wmnet mw1380.eqiad.wmnet mw1381.eqiad.wmnet mw1382.eqiad.wmnet mw1383.eqiad.wmnet mw1384.eqiad.wmnet mw1385.eqiad.wmnet [07:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:55] !log T375842 depooling mw1386.eqiad.wmnet mw1387.eqiad.wmnet mw1388.eqiad.wmnet mw1389.eqiad.wmnet mw1390.eqiad.wmnet mw1391.eqiad.wmnet mw1392.eqiad.wmnet mw1393.eqiad.wmnet mw1394.eqiad.wmnet mw1395.eqiad.wmnet mw1396.eqiad.wmnet mw1397.eqiad.wmnet mw1399.eqiad.wmnet mw1405.eqiad.wmnet mw1408.eqiad.wmnet mw1409.eqiad.wmnet T [07:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:05] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1354.eqiad.wmnet [07:14:11] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1355.eqiad.wmnet [07:14:43] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1355.eqiad.wmnet [07:14:50] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1356.eqiad.wmnet [07:15:27] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1356.eqiad.wmnet [07:15:32] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1357.eqiad.wmnet [07:16:08] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1357.eqiad.wmnet [07:16:15] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1360.eqiad.wmnet [07:16:52] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1360.eqiad.wmnet [07:16:58] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1361.eqiad.wmnet [07:17:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [07:17:31] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1361.eqiad.wmnet [07:17:36] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1362.eqiad.wmnet [07:18:13] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1362.eqiad.wmnet [07:18:19] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1363.eqiad.wmnet [07:18:55] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1363.eqiad.wmnet [07:19:01] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1367.eqiad.wmnet [07:19:30] (03CR) 10JMeybohm: [C:04-1] "Great addition! Would you be so kind to add something to the CHANGELOG of the module?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074099 (owner: 10Brouberol) [07:19:34] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1367.eqiad.wmnet [07:19:40] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1368.eqiad.wmnet [07:20:04] (03CR) 10JMeybohm: [C:03+1] Release new base.helper and base.meta modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074098 (owner: 10Brouberol) [07:20:12] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1368.eqiad.wmnet [07:20:18] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1369.eqiad.wmnet [07:20:55] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1369.eqiad.wmnet [07:21:01] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1370.eqiad.wmnet [07:21:37] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1370.eqiad.wmnet [07:21:43] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1371.eqiad.wmnet [07:22:15] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1371.eqiad.wmnet [07:22:21] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1374.eqiad.wmnet [07:22:54] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1374.eqiad.wmnet [07:23:00] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1375.eqiad.wmnet [07:26:10] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1375.eqiad.wmnet [07:26:16] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1376.eqiad.wmnet [07:26:49] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1376.eqiad.wmnet [07:26:55] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1377.eqiad.wmnet [07:27:28] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1377.eqiad.wmnet [07:27:34] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1378.eqiad.wmnet [07:28:10] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1378.eqiad.wmnet [07:28:16] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1379.eqiad.wmnet [07:28:49] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1379.eqiad.wmnet [07:28:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:28:55] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1380.eqiad.wmnet [07:29:31] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1380.eqiad.wmnet [07:29:37] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1381.eqiad.wmnet [07:30:10] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1381.eqiad.wmnet [07:30:16] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1382.eqiad.wmnet [07:30:48] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1382.eqiad.wmnet [07:30:54] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1383.eqiad.wmnet [07:34:05] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1383.eqiad.wmnet [07:34:11] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1384.eqiad.wmnet [07:34:47] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1384.eqiad.wmnet [07:34:53] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1385.eqiad.wmnet [07:35:30] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1385.eqiad.wmnet [07:35:36] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1386.eqiad.wmnet [07:35:54] 10SRE-swift-storage, 06Wikimedia Enterprise: Commonswiki recently updated files not found - https://phabricator.wikimedia.org/T375797#10182078 (10MatthewVernon) Are you entirely sure these files were actually uploaded today? I ask because, picking the first of these and searching our swift server logs for toda... [07:36:09] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1386.eqiad.wmnet [07:36:16] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1387.eqiad.wmnet [07:36:49] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1387.eqiad.wmnet [07:36:55] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1388.eqiad.wmnet [07:37:31] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1388.eqiad.wmnet [07:37:37] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1389.eqiad.wmnet [07:38:10] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1389.eqiad.wmnet [07:38:16] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1390.eqiad.wmnet [07:38:53] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1390.eqiad.wmnet [07:38:58] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1391.eqiad.wmnet [07:39:31] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1391.eqiad.wmnet [07:39:37] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1392.eqiad.wmnet [07:39:57] FIRING: SystemdUnitFailed: systemd-sysusers.service on parsoidtest1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:40:14] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1392.eqiad.wmnet [07:40:19] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1393.eqiad.wmnet [07:40:42] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: puppetserver[1001-1002,2001] crashed on 2024-09-27 00:00 - https://phabricator.wikimedia.org/T375839#10182082 (10Jelto) [07:40:56] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1393.eqiad.wmnet [07:41:02] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1394.eqiad.wmnet [07:41:35] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1394.eqiad.wmnet [07:41:40] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1395.eqiad.wmnet [07:42:13] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1395.eqiad.wmnet [07:42:19] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1396.eqiad.wmnet [07:42:52] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1396.eqiad.wmnet [07:42:58] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1397.eqiad.wmnet [07:43:31] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1397.eqiad.wmnet [07:43:37] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1399.eqiad.wmnet [07:44:13] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1399.eqiad.wmnet [07:44:19] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1405.eqiad.wmnet [07:44:56] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1405.eqiad.wmnet [07:44:56] FIRING: [2x] SystemdUnitFailed: systemd-sysusers.service on parsoidtest1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:45:02] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1408.eqiad.wmnet [07:45:35] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1408.eqiad.wmnet [07:45:40] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw1409.eqiad.wmnet [07:46:13] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw1409.eqiad.wmnet [07:48:55] FIRING: SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:50:17] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [07:50:25] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [07:53:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:59:14] 06SRE, 06Infrastructure-Foundations: Segfault for systemd-sysusers.service on stat1007 - https://phabricator.wikimedia.org/T256098#10182133 (10akosiaris) And we 've just seen this on parsoidtest1001 which is bullseye. Old host, scandium is on buster. [08:03:49] (03PS1) 10Muehlenhoff: On Buster/Bullseye don't use systemd-sysuser [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/1076144 [08:13:07] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [08:13:27] (03PS2) 10Muehlenhoff: On Buster/Bullseye don't use systemd-sysuser [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/1076144 [08:13:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [08:13:44] Deployment mw-api-int.eqiad.main in mw-api-int at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mw-api-int&var-deployment=mw-api-int.eqiad.main - ... [08:13:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [08:14:23] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] On Buster/Bullseye don't use systemd-sysuser [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/1076144 (owner: 10Muehlenhoff) [08:18:33] (03PS1) 10Muehlenhoff: Bump version in debian/changelog [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/1076145 [08:18:44] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment mw-api-int.eqiad.main in mw-api-int at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [08:19:37] (03PS1) 10Alexandros Kosiaris: decom mw135[2-7]|mw136[0-37-9]|mw137[14-90]|mw138[0-9]|mw139[0-79]|mw140[589] [puppet] - 10https://gerrit.wikimedia.org/r/1076146 (https://phabricator.wikimedia.org/T375842) [08:21:07] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: openstack: initial IPv6 support in neutron - https://phabricator.wikimedia.org/T375847 (10aborrero) 03NEW [08:22:38] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: openstack: initial IPv6 support in neutron - https://phabricator.wikimedia.org/T375847#10182170 (10aborrero) 05Open→03In progress p:05Triage→03Medium [08:23:40] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: openstack: initial IPv6 support in neutron - https://phabricator.wikimedia.org/T375847#10182174 (10aborrero) [08:23:44] FIRING: [7x] KubernetesDeploymentUnavailableReplicas: Deployment changeprop-production in changeprop-jobqueue at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [08:26:58] !log akosiaris@cumin1002 START - Cookbook sre.hosts.decommission for hosts mw[1352-1357,1360-1363,1367-1371,1399,1405,1408-1409].eqiad.wmnet [08:28:14] (03CR) 10Alexandros Kosiaris: [C:03+2] decom mw135[2-7]|mw136[0-37-9]|mw137[14-90]|mw138[0-9]|mw139[0-79]|mw140[589] [puppet] - 10https://gerrit.wikimedia.org/r/1076146 (https://phabricator.wikimedia.org/T375842) (owner: 10Alexandros Kosiaris) [08:28:55] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: openstack: initial IPv6 support in neutron - https://phabricator.wikimedia.org/T375847#10182196 (10aborrero) [08:33:44] FIRING: [3x] KubernetesDeploymentUnavailableReplicas: Deployment mw-jobrunner.eqiad.main in mw-jobrunner at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [08:34:32] * akosiaris fixing ^ [08:34:48] 06SRE, 06Infrastructure-Foundations, 10netops: netbox: create IPv6 entries for Cloud VPS - https://phabricator.wikimedia.org/T374712#10182201 (10aborrero) >>! In T374712#10163608, @cmooney wrote: > @arturo /64s for VM usage I guess can be allocated from [[ https://netbox.wikimedia.org/ipam/prefixes/1079/pref... [08:35:21] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:35:39] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:37:26] * akosiaris running homer for that ^ too [08:51:34] (03PS1) 10Brouberol: cloudnative-pg-cluster: ensure that real s3 credentials are provided [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076151 (https://phabricator.wikimedia.org/T375850) [08:52:33] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Bump version in debian/changelog [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/1076145 (owner: 10Muehlenhoff) [08:52:59] (03CR) 10CI reject: [V:04-1] cloudnative-pg-cluster: ensure that real s3 credentials are provided [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076151 (https://phabricator.wikimedia.org/T375850) (owner: 10Brouberol) [08:53:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [08:53:48] FIRING: PuppetFailure: Puppet has failed on parsoidtest1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:56:28] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [09:05:48] (03PS2) 10Brouberol: cloudnative-pg-cluster: ensure that real s3 credentials are provided [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076151 (https://phabricator.wikimedia.org/T375850) [09:06:33] (03CR) 10CI reject: [V:04-1] cloudnative-pg-cluster: ensure that real s3 credentials are provided [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076151 (https://phabricator.wikimedia.org/T375850) (owner: 10Brouberol) [09:06:50] (03PS1) 10Muehlenhoff: Fix detection of current OS [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/1076152 [09:07:42] (03PS3) 10Brouberol: cloudnative-pg-cluster: ensure that real s3 credentials are provided [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076151 (https://phabricator.wikimedia.org/T375850) [09:08:27] (03CR) 10Alexandros Kosiaris: [C:04-1] Fix detection of current OS (031 comment) [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/1076152 (owner: 10Muehlenhoff) [09:09:30] (03PS2) 10Muehlenhoff: Fix detection of current OS [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/1076152 [09:09:46] (03CR) 10Muehlenhoff: Fix detection of current OS (031 comment) [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/1076152 (owner: 10Muehlenhoff) [09:10:47] (03CR) 10Alexandros Kosiaris: [C:03+1] Fix detection of current OS (031 comment) [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/1076152 (owner: 10Muehlenhoff) [09:11:14] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Fix detection of current OS [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/1076152 (owner: 10Muehlenhoff) [09:14:10] (03PS4) 10Brouberol: cloudnative-pg-cluster: ensure that real s3 credentials are provided [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076151 (https://phabricator.wikimedia.org/T375850) [09:17:05] (03PS5) 10Brouberol: cloudnative-pg-cluster: ensure that real s3 credentials are provided [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076151 (https://phabricator.wikimedia.org/T375850) [09:17:44] (03PS6) 10Brouberol: cloudnative-pg-cluster: ensure that real s3 credentials are provided [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076151 (https://phabricator.wikimedia.org/T375850) [09:18:48] RESOLVED: PuppetFailure: Puppet has failed on parsoidtest1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:31:32] (03PS1) 10MVernon: hiera: add apus to wikimedia_clusters [puppet] - 10https://gerrit.wikimedia.org/r/1076158 (https://phabricator.wikimedia.org/T279621) [09:33:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:39:14] RESOLVED: [3x] KubernetesDeploymentUnavailableReplicas: Deployment mw-jobrunner.eqiad.main in mw-jobrunner at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [09:44:50] 07sre-alert-triage, 06SRE Observability: Alert in need of triage: SLIMetricMissing - https://phabricator.wikimedia.org/T372454#10182347 (10LSobanski) `SLIMetricMissing` for both DCs have been firing for 1 month again. [09:46:02] (03PS1) 10Muehlenhoff: Deprecate system::role for initial batch of serviceops services [puppet] - 10https://gerrit.wikimedia.org/r/1076160 [09:46:48] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1076158 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [09:53:21] (03CR) 10MVernon: [C:03+2] hiera: add apus to wikimedia_clusters [puppet] - 10https://gerrit.wikimedia.org/r/1076158 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [09:55:17] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375776#10182363 (10phaultfinder) [10:01:05] (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075935 (https://phabricator.wikimedia.org/T375715) (owner: 10Brouberol) [10:01:15] (03CR) 10Brouberol: [C:03+2] airflow: introduce a way to define Airflow variables as values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075935 (https://phabricator.wikimedia.org/T375715) (owner: 10Brouberol) [10:02:45] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:03:27] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:04:53] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: openstack: initial IPv6 support in neutron - https://phabricator.wikimedia.org/T375847#10182409 (10aborrero) new instance creation will allocate an IPv6 by default for a VM: {F57561880} [10:05:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [10:10:07] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [10:11:41] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: sync [10:12:07] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: sync [10:14:51] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: openstack: initial IPv6 support in neutron - https://phabricator.wikimedia.org/T375847#10182457 (10aborrero) however, instance creation itself failed: ` 2024-09-27 10:07:20.088 451966 ERROR nova.compute.manager [N... [10:18:08] (03PS1) 10Muehlenhoff: wdqs: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1076167 [10:20:46] (03PS1) 10Brouberol: airflow: fix, mount the variables.yaml file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076168 (https://phabricator.wikimedia.org/T375770) [10:21:28] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1076167 (owner: 10Muehlenhoff) [10:22:46] (03CR) 10Brouberol: [C:03+2] airflow: fix, mount the variables.yaml file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076168 (https://phabricator.wikimedia.org/T375770) (owner: 10Brouberol) [10:23:33] (03PS7) 10Brouberol: cloudnative-pg-cluster: ensure that real s3 credentials are provided [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076151 (https://phabricator.wikimedia.org/T375850) [10:44:56] FIRING: SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:47:31] (03CR) 10Daniel Kinzler: Make specs module available on beta and testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076058 (https://phabricator.wikimedia.org/T375512) (owner: 10BPirkle) [10:54:24] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: openstack: initial IPv6 support in neutron - https://phabricator.wikimedia.org/T375847#10182562 (10aborrero) neutron virtual router has the right IPv6 address: {F57561976} [10:58:19] jelto@cumin1002 jelto: The backup on gitlab2002 is complete, ready to proceed with upgrade. [11:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240927T0700) [11:00:04] eoghan, jelto, arnoldokoth, and mutante: #bothumor My software never has bugs. It just develops random features. Rise for GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240927T1100). [11:06:55] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [11:06:56] FIRING: [2x] ProbeDown: Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:07:15] ^ expexted, should resolve soon [11:10:33] (03PS12) 10JMeybohm: Initial commit of containerd puppet code [puppet] - 10https://gerrit.wikimedia.org/r/1075026 (https://phabricator.wikimedia.org/T362408) [11:11:11] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: openstack: initial IPv6 support in neutron - https://phabricator.wikimedia.org/T375847#10182616 (10aborrero) The VM did not get the IPv6 assigned in the interface via dhcpv6 :-( `lang=shell-session aborrero@ipv6:~$ ip -br a lo... [11:11:56] RESOLVED: [2x] ProbeDown: Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:12:02] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: openstack: initial IPv6 support in neutron - https://phabricator.wikimedia.org/T375847#10182629 (10aborrero) We got DNS integration half working: `lang=shell-session aborrero@ipv6:~$ host ipv6.cloudinfra-codfw1dev.codfw1dev.wikimedia.cl... [11:13:55] FIRING: [2x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:13:55] FIRING: [2x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:15:43] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [11:20:38] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw[1352-1357,1360-1363,1367-1371,1399,1405,1408-1409].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - akosiaris@cumin1002" [11:21:04] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw[1352-1357,1360-1363,1367-1371,1399,1405,1408-1409].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - akosiaris@cumin1002" [11:21:04] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:21:04] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw[1352-1357,1360-1363,1367-1371,1399,1405,1408-1409].eqiad.wmnet [11:22:01] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: openstack: initial IPv6 support in neutron - https://phabricator.wikimedia.org/T375847#10182667 (10aborrero) I see the dhcp6 packets from my test VM arriving into neutron: ` 11:20:29.156995 IP6 fe80::f816:3eff:fe3e:4b38.546 > ff02::1:2.... [11:23:52] !log akosiaris@cumin1002 START - Cookbook sre.hosts.decommission for hosts mw[1374-1393].eqiad.wmnet [11:30:48] (03PS5) 10Alexandros Kosiaris: Switch scandium references to parsoidtest1001 [puppet] - 10https://gerrit.wikimedia.org/r/1024400 (https://phabricator.wikimedia.org/T363399) [11:30:48] (03PS3) 10Alexandros Kosiaris: Remove scandium from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1024402 (https://phabricator.wikimedia.org/T363402) [11:31:15] (03CR) 10CI reject: [V:04-1] Remove scandium from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1024402 (https://phabricator.wikimedia.org/T363402) (owner: 10Alexandros Kosiaris) [11:40:45] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1024400 (https://phabricator.wikimedia.org/T363399) (owner: 10Alexandros Kosiaris) [11:47:27] (03PS1) 10Muehlenhoff: Test puppet-managed /var/lib/ganeti/known_hosts on ganeti-test2003 [puppet] - 10https://gerrit.wikimedia.org/r/1076188 (https://phabricator.wikimedia.org/T309724) [11:51:26] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1076188 (https://phabricator.wikimedia.org/T309724) (owner: 10Muehlenhoff) [11:53:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [11:58:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [11:58:30] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10182738 (10MoritzMuehlenhoff) [12:03:25] 06SRE, 06collaboration-services, 06Traffic, 13Patch-For-Review, 10Release-Engineering-Team (Radar): implement anti-abuse features for GitLab (Move GitLab behind the CDN) - https://phabricator.wikimedia.org/T366882#10182766 (10Jelto) I added a summary of the rate limiting and abuse tooling (including nfta... [12:03:55] (03PS1) 10D3r1ck01: [beta-cluster] Enable cookie-based SUL3 feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076195 (https://phabricator.wikimedia.org/T375787) [12:05:15] (03PS2) 10D3r1ck01: [beta-cluster] Enable cookie-based SUL3 feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076195 (https://phabricator.wikimedia.org/T375787) [12:05:38] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1240.eqiad.wmnet [12:05:38] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1240.eqiad.wmnet [12:05:44] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1241.eqiad.wmnet [12:05:44] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1241.eqiad.wmnet [12:05:50] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1242.eqiad.wmnet [12:05:50] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1242.eqiad.wmnet [12:05:56] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1243.eqiad.wmnet [12:05:56] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1243.eqiad.wmnet [12:06:02] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1244.eqiad.wmnet [12:06:02] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1244.eqiad.wmnet [12:06:08] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1245.eqiad.wmnet [12:06:08] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1245.eqiad.wmnet [12:06:16] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1246.eqiad.wmnet [12:06:16] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1246.eqiad.wmnet [12:06:23] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1247.eqiad.wmnet [12:06:23] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1247.eqiad.wmnet [12:06:29] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1248.eqiad.wmnet [12:06:29] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1248.eqiad.wmnet [12:06:35] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1249.eqiad.wmnet [12:06:35] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1249.eqiad.wmnet [12:06:41] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1250.eqiad.wmnet [12:06:41] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1250.eqiad.wmnet [12:06:47] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1251.eqiad.wmnet [12:06:47] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1251.eqiad.wmnet [12:06:53] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1252.eqiad.wmnet [12:06:53] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1252.eqiad.wmnet [12:06:59] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1253.eqiad.wmnet [12:06:59] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1253.eqiad.wmnet [12:07:06] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1254.eqiad.wmnet [12:07:06] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1254.eqiad.wmnet [12:07:12] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1255.eqiad.wmnet [12:07:12] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1255.eqiad.wmnet [12:07:18] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1256.eqiad.wmnet [12:07:18] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1256.eqiad.wmnet [12:07:24] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1257.eqiad.wmnet [12:07:24] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1257.eqiad.wmnet [12:07:30] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1258.eqiad.wmnet [12:07:30] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1258.eqiad.wmnet [12:07:36] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1259.eqiad.wmnet [12:07:36] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1259.eqiad.wmnet [12:07:42] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1260.eqiad.wmnet [12:07:42] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1260.eqiad.wmnet [12:07:48] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1261.eqiad.wmnet [12:07:48] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1261.eqiad.wmnet [12:07:54] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1262.eqiad.wmnet [12:07:54] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1262.eqiad.wmnet [12:08:00] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1263.eqiad.wmnet [12:08:00] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1263.eqiad.wmnet [12:08:06] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1264.eqiad.wmnet [12:08:06] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1264.eqiad.wmnet [12:08:13] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1265.eqiad.wmnet [12:08:13] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1265.eqiad.wmnet [12:08:19] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1266.eqiad.wmnet [12:08:19] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1266.eqiad.wmnet [12:08:25] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1267.eqiad.wmnet [12:08:25] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1267.eqiad.wmnet [12:08:31] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1268.eqiad.wmnet [12:08:31] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1268.eqiad.wmnet [12:08:37] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1269.eqiad.wmnet [12:08:37] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1269.eqiad.wmnet [12:08:43] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1270.eqiad.wmnet [12:08:43] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1270.eqiad.wmnet [12:08:49] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1271.eqiad.wmnet [12:08:50] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1271.eqiad.wmnet [12:08:55] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1272.eqiad.wmnet [12:08:55] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1272.eqiad.wmnet [12:09:01] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1273.eqiad.wmnet [12:09:02] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1273.eqiad.wmnet [12:09:08] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1274.eqiad.wmnet [12:09:08] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1274.eqiad.wmnet [12:09:14] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1275.eqiad.wmnet [12:09:14] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1275.eqiad.wmnet [12:09:20] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1276.eqiad.wmnet [12:09:20] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1276.eqiad.wmnet [12:09:26] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1277.eqiad.wmnet [12:09:26] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1277.eqiad.wmnet [12:09:32] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1278.eqiad.wmnet [12:09:32] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1278.eqiad.wmnet [12:09:38] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1279.eqiad.wmnet [12:09:38] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1279.eqiad.wmnet [12:09:44] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1280.eqiad.wmnet [12:09:44] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1280.eqiad.wmnet [12:09:50] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1281.eqiad.wmnet [12:09:50] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1281.eqiad.wmnet [12:09:56] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1282.eqiad.wmnet [12:09:56] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1282.eqiad.wmnet [12:10:02] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1283.eqiad.wmnet [12:10:02] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1283.eqiad.wmnet [12:10:08] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1284.eqiad.wmnet [12:10:08] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1284.eqiad.wmnet [12:10:15] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1285.eqiad.wmnet [12:10:15] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1285.eqiad.wmnet [12:10:21] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1286.eqiad.wmnet [12:10:21] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1286.eqiad.wmnet [12:10:27] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1287.eqiad.wmnet [12:10:27] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1287.eqiad.wmnet [12:10:34] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1288.eqiad.wmnet [12:10:34] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1288.eqiad.wmnet [12:10:40] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1289.eqiad.wmnet [12:10:40] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1289.eqiad.wmnet [12:10:46] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1290.eqiad.wmnet [12:10:46] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1290.eqiad.wmnet [12:10:52] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1291.eqiad.wmnet [12:10:52] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1291.eqiad.wmnet [12:10:58] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1292.eqiad.wmnet [12:10:58] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1292.eqiad.wmnet [12:11:03] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1293.eqiad.wmnet [12:11:04] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1293.eqiad.wmnet [12:11:10] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1294.eqiad.wmnet [12:11:10] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1294.eqiad.wmnet [12:11:16] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1295.eqiad.wmnet [12:11:16] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1295.eqiad.wmnet [12:11:22] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1296.eqiad.wmnet [12:11:22] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1296.eqiad.wmnet [12:11:28] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1297.eqiad.wmnet [12:11:28] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1297.eqiad.wmnet [12:11:34] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1298.eqiad.wmnet [12:11:35] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1298.eqiad.wmnet [12:11:40] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1299.eqiad.wmnet [12:11:41] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1299.eqiad.wmnet [12:11:47] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1300.eqiad.wmnet [12:11:47] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1300.eqiad.wmnet [12:11:53] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1301.eqiad.wmnet [12:11:53] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1301.eqiad.wmnet [12:11:59] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1302.eqiad.wmnet [12:11:59] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1302.eqiad.wmnet [12:12:05] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1303.eqiad.wmnet [12:12:05] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1303.eqiad.wmnet [12:12:12] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1304.eqiad.wmnet [12:12:12] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1304.eqiad.wmnet [12:13:05] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1240.eqiad.wmnet [12:13:07] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1240.eqiad.wmnet [12:13:15] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1241.eqiad.wmnet [12:13:17] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1241.eqiad.wmnet [12:13:24] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1242.eqiad.wmnet [12:13:27] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1242.eqiad.wmnet [12:13:33] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1243.eqiad.wmnet [12:13:36] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1243.eqiad.wmnet [12:13:36] !log T369744 pool wikikube-worker1280-1304. They are now fully in production [12:13:42] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1244.eqiad.wmnet [12:13:44] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1244.eqiad.wmnet [12:13:51] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1245.eqiad.wmnet [12:13:53] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1245.eqiad.wmnet [12:14:00] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1246.eqiad.wmnet [12:14:02] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1246.eqiad.wmnet [12:14:08] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1247.eqiad.wmnet [12:14:10] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1247.eqiad.wmnet [12:14:17] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1248.eqiad.wmnet [12:14:19] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1248.eqiad.wmnet [12:14:26] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1249.eqiad.wmnet [12:14:28] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1249.eqiad.wmnet [12:14:35] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1250.eqiad.wmnet [12:14:37] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1250.eqiad.wmnet [12:14:43] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1251.eqiad.wmnet [12:14:45] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1251.eqiad.wmnet [12:14:52] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1252.eqiad.wmnet [12:14:54] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1252.eqiad.wmnet [12:15:00] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1253.eqiad.wmnet [12:15:02] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1253.eqiad.wmnet [12:15:09] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1254.eqiad.wmnet [12:15:11] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1254.eqiad.wmnet [12:15:17] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1255.eqiad.wmnet [12:15:19] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1255.eqiad.wmnet [12:15:26] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1256.eqiad.wmnet [12:15:28] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1256.eqiad.wmnet [12:15:34] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1257.eqiad.wmnet [12:15:36] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1257.eqiad.wmnet [12:15:43] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1258.eqiad.wmnet [12:15:45] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1258.eqiad.wmnet [12:15:52] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1259.eqiad.wmnet [12:15:54] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1259.eqiad.wmnet [12:16:01] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1260.eqiad.wmnet [12:16:03] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1260.eqiad.wmnet [12:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:07] T369744: wikikube-worker1240 to wikikube-worker1304 implementation tracking - https://phabricator.wikimedia.org/T369744 [12:16:09] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1261.eqiad.wmnet [12:16:11] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1261.eqiad.wmnet [12:16:18] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1262.eqiad.wmnet [12:16:20] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1262.eqiad.wmnet [12:16:27] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1263.eqiad.wmnet [12:16:29] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1263.eqiad.wmnet [12:16:36] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1264.eqiad.wmnet [12:16:36] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [12:16:38] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1264.eqiad.wmnet [12:16:45] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1265.eqiad.wmnet [12:16:47] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1265.eqiad.wmnet [12:16:54] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1266.eqiad.wmnet [12:16:56] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1266.eqiad.wmnet [12:17:02] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1267.eqiad.wmnet [12:17:04] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1267.eqiad.wmnet [12:17:11] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1268.eqiad.wmnet [12:17:13] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1268.eqiad.wmnet [12:17:19] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1269.eqiad.wmnet [12:17:21] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1269.eqiad.wmnet [12:17:28] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1270.eqiad.wmnet [12:17:30] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1270.eqiad.wmnet [12:17:37] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1271.eqiad.wmnet [12:17:39] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1271.eqiad.wmnet [12:17:46] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1272.eqiad.wmnet [12:17:48] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1272.eqiad.wmnet [12:17:55] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1273.eqiad.wmnet [12:17:57] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1273.eqiad.wmnet [12:18:04] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1274.eqiad.wmnet [12:18:06] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1274.eqiad.wmnet [12:18:13] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1275.eqiad.wmnet [12:18:15] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1275.eqiad.wmnet [12:18:21] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1276.eqiad.wmnet [12:18:23] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1276.eqiad.wmnet [12:18:30] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1277.eqiad.wmnet [12:18:32] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1277.eqiad.wmnet [12:18:39] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1278.eqiad.wmnet [12:18:41] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1278.eqiad.wmnet [12:18:48] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1279.eqiad.wmnet [12:18:50] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1279.eqiad.wmnet [12:18:56] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1280.eqiad.wmnet [12:18:58] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1280.eqiad.wmnet [12:19:05] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1281.eqiad.wmnet [12:19:07] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1281.eqiad.wmnet [12:19:14] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1282.eqiad.wmnet [12:19:16] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1282.eqiad.wmnet [12:19:23] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1283.eqiad.wmnet [12:19:25] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1283.eqiad.wmnet [12:19:32] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1284.eqiad.wmnet [12:19:34] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1284.eqiad.wmnet [12:19:40] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1285.eqiad.wmnet [12:19:43] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1285.eqiad.wmnet [12:19:49] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1286.eqiad.wmnet [12:19:51] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1286.eqiad.wmnet [12:19:58] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1287.eqiad.wmnet [12:20:00] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1287.eqiad.wmnet [12:20:07] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1288.eqiad.wmnet [12:20:09] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1288.eqiad.wmnet [12:20:16] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1289.eqiad.wmnet [12:20:18] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1289.eqiad.wmnet [12:20:25] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1290.eqiad.wmnet [12:20:27] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1290.eqiad.wmnet [12:20:34] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1291.eqiad.wmnet [12:20:36] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1291.eqiad.wmnet [12:20:42] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1292.eqiad.wmnet [12:20:44] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1292.eqiad.wmnet [12:20:51] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1293.eqiad.wmnet [12:20:53] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1293.eqiad.wmnet [12:20:59] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1294.eqiad.wmnet [12:21:02] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1294.eqiad.wmnet [12:21:08] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1295.eqiad.wmnet [12:21:10] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1295.eqiad.wmnet [12:21:17] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1296.eqiad.wmnet [12:21:19] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1296.eqiad.wmnet [12:21:26] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1297.eqiad.wmnet [12:21:28] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1297.eqiad.wmnet [12:21:35] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1298.eqiad.wmnet [12:21:37] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [12:21:37] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1298.eqiad.wmnet [12:21:44] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1299.eqiad.wmnet [12:21:46] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1299.eqiad.wmnet [12:21:52] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1300.eqiad.wmnet [12:21:52] (03PS1) 10Slyngshede: Password change form. [software/bitu] - 10https://gerrit.wikimedia.org/r/1076197 (https://phabricator.wikimedia.org/T365370) [12:21:54] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1300.eqiad.wmnet [12:22:02] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1301.eqiad.wmnet [12:22:04] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1301.eqiad.wmnet [12:22:10] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1302.eqiad.wmnet [12:22:12] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1302.eqiad.wmnet [12:22:19] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1303.eqiad.wmnet [12:22:21] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1303.eqiad.wmnet [12:22:27] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1304.eqiad.wmnet [12:22:29] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1304.eqiad.wmnet [12:23:13] (03CR) 10CI reject: [V:04-1] Password change form. [software/bitu] - 10https://gerrit.wikimedia.org/r/1076197 (https://phabricator.wikimedia.org/T365370) (owner: 10Slyngshede) [12:23:33] (03PS8) 10Brouberol: cloudnative-pg-cluster: ensure that real s3 credentials are provided [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076151 (https://phabricator.wikimedia.org/T375850) [12:27:58] (03PS2) 10Slyngshede: Password change form. [software/bitu] - 10https://gerrit.wikimedia.org/r/1076197 (https://phabricator.wikimedia.org/T365370) [12:40:45] (03PS24) 10Btullis: sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn) [12:41:06] (03CR) 10CI reject: [V:04-1] sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn) [12:41:37] (03PS25) 10Btullis: sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn) [12:41:57] (03CR) 10CI reject: [V:04-1] sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn) [12:42:09] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [12:45:37] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw[1374-1393].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - akosiaris@cumin1002" [12:47:00] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw[1374-1393].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - akosiaris@cumin1002" [12:47:00] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:47:01] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mw[1374-1393].eqiad.wmnet [12:48:33] !log akosiaris@cumin1002 START - Cookbook sre.hosts.decommission for hosts mw[1393-1397].eqiad.wmnet [12:53:17] !log jynus@cumin1002 dbctl commit (dc=all): 'repool es1022 with 1% load', diff saved to https://phabricator.wikimedia.org/P69433 and previous config saved to /var/cache/conftool/dbconfig/20240927-125317-jynus.json [12:56:28] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [13:00:12] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:00:16] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:00:18] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:01:14] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:01:18] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:01:18] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:08:00] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [13:08:56] !log jynus@cumin1002 dbctl commit (dc=all): 'repool es1022 with 100% load', diff saved to https://phabricator.wikimedia.org/P69434 and previous config saved to /var/cache/conftool/dbconfig/20240927-130855-jynus.json [13:12:18] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on es1022 - https://phabricator.wikimedia.org/T375257#10183186 (10jcrespo) 05In progress→03Resolved I double checked the RAID status and everything is looking good (overal status and all disk status). I repooled the server at 100% weight. Thanks to... [13:16:51] (03CR) 10Brouberol: [C:03+2] cloudnative-pg-cluster: ensure that real s3 credentials are provided [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076151 (https://phabricator.wikimedia.org/T375850) (owner: 10Brouberol) [13:17:26] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 599223592 and 21 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:19:26] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 129744 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:19:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074396 (https://phabricator.wikimedia.org/T373967) (owner: 10Santiago Faci) [13:19:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074396 (https://phabricator.wikimedia.org/T373967) (owner: 10Santiago Faci) [13:20:32] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T375328#10183194 (10phaultfinder) [13:21:32] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw[1393-1397].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - akosiaris@cumin1002" [13:22:32] (03CR) 10Ssingh: sre.cdn.pdns-recursor: add rolling restart script (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [13:22:45] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw[1393-1397].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - akosiaris@cumin1002" [13:22:45] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:22:45] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mw[1393-1397].eqiad.wmnet [13:25:06] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375776#10183201 (10phaultfinder) [13:25:29] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T375328#10183202 (10phaultfinder) [13:30:03] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: cp2037 hardware issues: A fatal error was detected on a component at bus 174 device 0 function 0 - https://phabricator.wikimedia.org/T375766#10183208 (10ssingh) 05Open→03Resolved Host has been pooled and no issues since, thanks for the quick turnaround @Jha... [13:31:18] (03PS3) 10Brouberol: Release new base.helper and base.meta modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074098 [13:31:18] (03PS5) 10Brouberol: Compute stable secret checksums by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074099 [13:31:31] (03CR) 10Brouberol: "Done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074099 (owner: 10Brouberol) [13:32:01] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:32:09] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:35:22] !incidents [13:35:23] 5284 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cloudsw1-e4-eqiad.mgmt.eqiad.wmnet) [13:35:23] 5283 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cloudsw1-d5-eqiad.mgmt.eqiad.wmnet) [13:38:12] denisse: so far all good :) [13:38:45] sukhe: nice! 🙌🏻 [13:41:50] 06SRE, 06serviceops: php7.2-fpm_check_restart should be resilient to php7adm error pages - https://phabricator.wikimedia.org/T285593#10183281 (10akosiaris) 05Open→03Invalid 3 years later, we no longer have appservers, this is probably best closed as invalid [13:45:00] (03PS1) 10Brouberol: wikimedia.org: provision subdomains for our airflow instances [dns] - 10https://gerrit.wikimedia.org/r/1076206 (https://phabricator.wikimedia.org/T371208) [13:46:10] (03PS2) 10Brouberol: wikimedia.org: provision subdomains for our airflow instances [dns] - 10https://gerrit.wikimedia.org/r/1076206 (https://phabricator.wikimedia.org/T371208) [13:46:58] (03CR) 10Ssingh: wikimedia.org: provision subdomains for our airflow instances (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1076206 (https://phabricator.wikimedia.org/T371208) (owner: 10Brouberol) [13:47:24] (03CR) 10CI reject: [V:04-1] wikimedia.org: provision subdomains for our airflow instances [dns] - 10https://gerrit.wikimedia.org/r/1076206 (https://phabricator.wikimedia.org/T371208) (owner: 10Brouberol) [13:47:59] (03PS3) 10Brouberol: wikimedia.org: provision subdomains for our airflow instances [dns] - 10https://gerrit.wikimedia.org/r/1076206 (https://phabricator.wikimedia.org/T371208) [13:48:06] (03CR) 10Brouberol: wikimedia.org: provision subdomains for our airflow instances (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1076206 (https://phabricator.wikimedia.org/T371208) (owner: 10Brouberol) [13:52:40] !log imported cricrl 1.31.1-2 to bookworm-wikimedia - T362408 [13:52:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:47] T362408: Migration to containerd and away from docker - https://phabricator.wikimedia.org/T362408 [13:53:18] (03CR) 10Ssingh: [C:03+1] "Zone files look good, can't comment on the hosts so leaving that for others :)" [dns] - 10https://gerrit.wikimedia.org/r/1076206 (https://phabricator.wikimedia.org/T371208) (owner: 10Brouberol) [13:55:06] (03PS13) 10JMeybohm: Initial commit of containerd puppet code [puppet] - 10https://gerrit.wikimedia.org/r/1075026 (https://phabricator.wikimedia.org/T362408) [13:56:48] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T375785#10183335 (10Papaul) @Jhancock.wm can you please check this? Maybe while removing the old switches yesterday we bumped into the one of the power cord. thank you. [13:56:56] 10ops-codfw, 06SRE, 06DC-Ops: hw troubleshooting: CPU 1 machine check error for mc2038.codfw.wmnet - https://phabricator.wikimedia.org/T375495#10183336 (10Jhancock.wm) 05Open→03Resolved [13:59:54] (03PS26) 10Btullis: sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn) [14:00:07] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:rack/install/configuration new firewalls - https://phabricator.wikimedia.org/T374176#10183344 (10Papaul) Firewalls are ready, the only thing left is to setup the SSL certificate. I will working with @Jgreen next week to do tha... [14:00:46] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4146/co" [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn) [14:03:12] 10SRE-swift-storage, 06Wikimedia Enterprise: Commonswiki recently updated files not found - https://phabricator.wikimedia.org/T375797#10183346 (10prabhat) @MatthewVernon Looking at the imageinfo `timestamp` suggests the files were uploaded or changed on 2024-09-26 ` { "batchcomplete": true, "query":... [14:03:18] (03PS27) 10Btullis: sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn) [14:04:09] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4147/co" [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn) [14:05:00] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T375785#10183358 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm reseated psu1. alert cleared. [14:08:24] (03CR) 10Muehlenhoff: [C:03+2] Send output of daily account check to SRE IF alias only [puppet] - 10https://gerrit.wikimedia.org/r/1076091 (owner: 10Muehlenhoff) [14:17:22] (03PS28) 10Btullis: sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn) [14:18:07] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4148/co" [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn) [14:20:17] (03CR) 10Btullis: [V:03+1 C:03+2] sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn) [14:20:39] (03CR) 10Scott French: "Just learned that @glavagetto@wikimedia.org has a patch that instead goes the route of modernizing `merge_config` (If7bc37b3388b681118332a" [puppet] - 10https://gerrit.wikimedia.org/r/1076040 (owner: 10Scott French) [14:25:18] 06SRE, 06Infrastructure-Foundations: puppetserver* thrashing and requiring a power cycle as a result - https://phabricator.wikimedia.org/T373527#10183390 (10CDanis) [14:34:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 21.08% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:35:56] (03CR) 10JMeybohm: [C:03+1] Compute stable secret checksums by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074099 (owner: 10Brouberol) [14:38:14] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 23.81% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:39:45] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 24.34% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:40:00] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 23.81% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:42:01] (03CR) 10Brouberol: [C:03+2] Release new base.helper and base.meta modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074098 (owner: 10Brouberol) [14:42:05] (03CR) 10Brouberol: [C:03+2] Compute stable secret checksums by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074099 (owner: 10Brouberol) [14:43:10] (03Merged) 10jenkins-bot: Release new base.helper and base.meta modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074098 (owner: 10Brouberol) [14:43:11] (03Merged) 10jenkins-bot: Compute stable secret checksums by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074099 (owner: 10Brouberol) [14:45:12] FIRING: SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:55:20] (03CR) 10Scott French: [C:03+1] "Thanks for fixing this!" [puppet] - 10https://gerrit.wikimedia.org/r/1075153 (owner: 10Giuseppe Lavagetto) [14:57:30] (03PS1) 10Btullis: Disable exposure warning for airflow webservers by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076213 (https://phabricator.wikimedia.org/T375739) [15:01:23] (03PS1) 10Brouberol: hotfix: only print the checksum at the end of the iteration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076214 [15:03:11] (03CR) 10Brouberol: [C:03+2] hotfix: only print the checksum at the end of the iteration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076214 (owner: 10Brouberol) [15:03:14] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:55] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375776#10183587 (10phaultfinder) [15:13:55] FIRING: SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:15:19] (03PS1) 10Brouberol: airflow: only change the configuration/secret checksum when actual config changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076217 (https://phabricator.wikimedia.org/T375886) [15:16:41] (03CR) 10Brouberol: [C:03+1] Disable exposure warning for airflow webservers by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076213 (https://phabricator.wikimedia.org/T375739) (owner: 10Btullis) [15:20:14] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375776#10183659 (10phaultfinder) [15:22:14] (03PS1) 10Brouberol: cloudnative-pg-cluster: allow the operator to reload a cluster s3 credentials [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076218 (https://phabricator.wikimedia.org/T375853) [15:24:36] (03PS5) 10Bking: airflow: allow traffic to webserver port from dse-k8s pods [puppet] - 10https://gerrit.wikimedia.org/r/1075321 (https://phabricator.wikimedia.org/T374948) [15:24:52] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075321 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [15:25:16] (03PS2) 10Brouberol: cloudnative-pg-cluster: allow the operator to reload a cluster secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076218 (https://phabricator.wikimedia.org/T375853) [15:26:59] (03CR) 10Bking: airflow: allow traffic to webserver port from dse-k8s pods (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1075321 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [15:30:25] (03PS1) 10Btullis: Fix the role_contacts for snapshot1015 [puppet] - 10https://gerrit.wikimedia.org/r/1076219 (https://phabricator.wikimedia.org/T374178) [15:30:55] (03CR) 10Brouberol: "LGTM but there's a small typo in the srange default value" [puppet] - 10https://gerrit.wikimedia.org/r/1075321 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [15:31:11] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4149/co" [puppet] - 10https://gerrit.wikimedia.org/r/1076219 (https://phabricator.wikimedia.org/T374178) (owner: 10Btullis) [15:31:15] (03CR) 10Tiziano Fogli: [C:03+2] Fix the role_contacts for snapshot1015 [puppet] - 10https://gerrit.wikimedia.org/r/1076219 (https://phabricator.wikimedia.org/T374178) (owner: 10Btullis) [15:32:51] (03PS6) 10Bking: airflow: allow traffic to webserver port from dse-k8s pods [puppet] - 10https://gerrit.wikimedia.org/r/1075321 (https://phabricator.wikimedia.org/T374948) [15:33:18] (03CR) 10Btullis: [C:03+1] "Looks good." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076218 (https://phabricator.wikimedia.org/T375853) (owner: 10Brouberol) [15:33:42] (03PS1) 10Brouberol: cloudnative-pg-cluster: upsize the WAL storage volume to 15GB by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076220 (https://phabricator.wikimedia.org/T375846) [15:33:47] (03PS7) 10Bking: airflow: allow traffic to webserver port from dse-k8s pods [puppet] - 10https://gerrit.wikimedia.org/r/1075321 (https://phabricator.wikimedia.org/T374948) [15:34:28] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075321 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [15:34:53] (03CR) 10Brouberol: [C:03+1] airflow: allow traffic to webserver port from dse-k8s pods (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1075321 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [15:39:05] (03PS2) 10Brouberol: airflow: only change the configuration/secret checksum when actual config changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076217 (https://phabricator.wikimedia.org/T375886) [16:00:06] (03CR) 10Thcipriani: [C:03+1] Define $wmgLBFactoryConfigCallback in offline mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076047 (owner: 10Dduvall) [16:02:04] just a heads up folks, i'm going to do a noop config deployment of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1076047 to fix our nightly image build [16:02:41] ok [16:02:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dduvall@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076047 (owner: 10Dduvall) [16:03:38] (03Merged) 10jenkins-bot: Define $wmgLBFactoryConfigCallback in offline mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076047 (owner: 10Dduvall) [16:04:10] !log dduvall@deploy2002 Started scap sync-world: Backport for [[gerrit:1076047|Define $wmgLBFactoryConfigCallback in offline mode]] [16:17:00] !log dduvall@deploy2002 dduvall: Backport for [[gerrit:1076047|Define $wmgLBFactoryConfigCallback in offline mode]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:18:01] !log dduvall@deploy2002 dduvall: Continuing with sync [16:26:28] !log dduvall@deploy2002 Finished scap sync-world: Backport for [[gerrit:1076047|Define $wmgLBFactoryConfigCallback in offline mode]] (duration: 22m 18s) [16:30:19] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375776#10183875 (10phaultfinder) [16:43:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web (k8s) 1.235s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:47:43] ^ looks like the brief latency bump following a deployment stretches out a bit when all the traffic you have is health checks [16:48:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web (k8s) 1.235s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:56:09] (03CR) 10Bking: [C:03+2] airflow: allow traffic to webserver port from dse-k8s pods [puppet] - 10https://gerrit.wikimedia.org/r/1075321 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [16:56:25] (03CR) 10Bking: [C:03+2] airflow: allow traffic to webserver port from dse-k8s pods (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1075321 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [16:56:28] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [16:59:57] FIRING: [2x] SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:05:15] 10ops-codfw, 06DC-Ops: Port with no description on access switch - https://phabricator.wikimedia.org/T375908 (10phaultfinder) 03NEW [17:41:17] 06SRE, 10Dumps 2.0, 10Dumps-Generation, 13Patch-For-Review: Dumps generation cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#10184143 (10Ladsgroup) started alerting again. Hasn't paged yet. [18:11:29] 06SRE, 06Infrastructure-Foundations, 10Mail: Lisa@wikipedia.org is receiving a large number of donor responses - https://phabricator.wikimedia.org/T375643#10184267 (10nisrael) Yes, @Reedy she's receiving them at her @wikimedia.org inbox, however the address we're using in our donor-facing emails is lisa@wiki... [18:20:18] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375776#10184281 (10phaultfinder) [19:13:55] FIRING: SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:25:11] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375776#10184532 (10phaultfinder) [19:53:41] hmm [20:01:29] (03PS2) 10NMW03: Update wgMetaNamespace for tlywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076254 (https://phabricator.wikimedia.org/T367009) [20:01:53] 06SRE, 10Dumps 2.0, 10Dumps-Generation, 13Patch-For-Review: Dumps generation cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#10184626 (10xcollazo) >>! In T368098#10184143, @Ladsgroup wrote: > started alerting again. Hasn't paged yet. Same `enwiki` query? [20:02:26] (03CR) 10NMW03: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076254 (https://phabricator.wikimedia.org/T367009) (owner: 10NMW03) [20:05:25] (03CR) 10NMW03: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076254 (https://phabricator.wikimedia.org/T367009) (owner: 10NMW03) [20:28:26] 06SRE, 10Dumps 2.0, 10Dumps-Generation, 13Patch-For-Review: Dumps generation cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#10184732 (10Ladsgroup) I'm not seeing any right now, if it alerts again, I copy paste but I think it's very likely dumps, everything else s... [20:42:23] (03PS1) 10LeLutin: check_bacula: gracefully handle cases where timestamp values are empty [puppet] - 10https://gerrit.wikimedia.org/r/1076269 [20:42:23] (03PS1) 10LeLutin: check_bacula: filter lines while ignoring case [puppet] - 10https://gerrit.wikimedia.org/r/1076270 [20:56:28] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [21:00:12] FIRING: [2x] SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:06:01] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:06:51] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:50:08] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375776#10184902 (10phaultfinder) [22:13:11] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:13:55] FIRING: SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:18:23] !log Starting time limited scan on enwiki to catch-up with monthly limit rate - https://wikitech.wikimedia.org/wiki/MediaModeration [23:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:25] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1076279 [23:38:25] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1076279 (owner: 10TrainBranchBot)