[00:32:22] RECOVERY - Check systemd state on mw1438 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:43:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T306560)', diff saved to https://phabricator.wikimedia.org/P26326 and previous config saved to /var/cache/conftool/dbconfig/20220424-004315-ladsgroup.json [00:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:21] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [00:58:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P26327 and previous config saved to /var/cache/conftool/dbconfig/20220424-005820-ladsgroup.json [00:58:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:02:44] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [01:13:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P26328 and previous config saved to /var/cache/conftool/dbconfig/20220424-011325-ladsgroup.json [01:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:28:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T306560)', diff saved to https://phabricator.wikimedia.org/P26329 and previous config saved to /var/cache/conftool/dbconfig/20220424-012830-ladsgroup.json [01:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:28:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance [01:28:37] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [01:28:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance [01:28:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 12 hosts with reason: Maintenance [01:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:28:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 12 hosts with reason: Maintenance [01:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:35:10] PROBLEM - Check systemd state on ml-staging-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:40:45] (JobUnavailable) firing: (5) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:45:45] (JobUnavailable) firing: (5) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:49:06] RECOVERY - Check systemd state on ml-staging-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:08:12] PROBLEM - Disk space on ml-staging-ctrl2001 is CRITICAL: DISK CRITICAL - free space: / 1126 MB (5% inode=95%): /tmp 1126 MB (5% inode=95%): /var/tmp 1126 MB (5% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-staging-ctrl2001&var-datasource=codfw+prometheus/ops [02:19:12] PROBLEM - Check systemd state on ml-staging-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:24:46] PROBLEM - SSH on furud.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:33:04] RECOVERY - Check systemd state on ml-staging-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:50:46] PROBLEM - Disk space on ml-staging-ctrl2001 is CRITICAL: DISK CRITICAL - free space: / 1127 MB (5% inode=95%): /tmp 1127 MB (5% inode=95%): /var/tmp 1127 MB (5% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-staging-ctrl2001&var-datasource=codfw+prometheus/ops [02:53:40] PROBLEM - Check systemd state on ml-staging-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:58:06] RECOVERY - Check systemd state on ml-staging-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:05:06] PROBLEM - Check systemd state on ml-staging-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:12:02] RECOVERY - Check systemd state on ml-staging-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:19:02] PROBLEM - Check systemd state on ml-staging-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:20:45] !log optimizing flaggedtemplates on plwiki (s2) in db2088 [03:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:25:58] RECOVERY - SSH on furud.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:59:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestagemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:00:26] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:09:44] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: debian-weekly-rebuild.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:51:10] PROBLEM - Puppet CA expired certs on puppetmaster1001 is CRITICAL: CRITICAL: 1 puppet certs need to be renewed: https://wikitech.wikimedia.org/wiki/Puppet%23Renew_agent_certificate [05:02:44] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:11:32] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:22:02] RECOVERY - Check systemd state on ml-staging-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:28:42] PROBLEM - Check systemd state on ml-staging-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:36:32] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:46:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:54:04] RECOVERY - Check systemd state on ml-staging-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:00:58] PROBLEM - Check systemd state on ml-staging-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:27:00] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [06:29:14] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 3 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [06:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:33:06] RECOVERY - Check systemd state on ml-staging-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:40:00] PROBLEM - Check systemd state on ml-staging-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:07:48] RECOVERY - Check systemd state on ml-staging-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:28:40] PROBLEM - Check systemd state on ml-staging-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:36:02] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [07:37:52] RECOVERY - Check systemd state on ml-staging-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:40:38] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 19 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [07:59:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestagemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:03:42] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:07:58] PROBLEM - Check systemd state on ml-staging-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:21:52] RECOVERY - Check systemd state on ml-staging-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:46] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:47:22] PROBLEM - Check systemd state on ml-staging-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:52:02] RECOVERY - Check systemd state on ml-staging-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:01:56] RECOVERY - Check systemd state on mw1440 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:02:44] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:04:54] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:13:52] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:29:14] PROBLEM - Check systemd state on ml-staging-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:43:12] RECOVERY - Check systemd state on ml-staging-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:46:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:05:16] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_webrequest_partitions.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:10:56] PROBLEM - Check systemd state on ml-staging-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:20:12] RECOVERY - Check systemd state on ml-staging-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:22:01] downtimed ml-staging alerts to reduce spam (cluster WIP) [10:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:45:16] PROBLEM - Check systemd state on ml-staging-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:56:32] RECOVERY - Check systemd state on ml-staging-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:05:26] PROBLEM - Check systemd state on ml-staging-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:10:04] RECOVERY - Check systemd state on ml-staging-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:17:00] PROBLEM - Check systemd state on ml-staging-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:33:12] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [11:35:32] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 12 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [11:49:50] RECOVERY - Check systemd state on mw1338 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:54:02] RECOVERY - Check systemd state on ml-staging-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:59:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestagemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:00:58] PROBLEM - Check systemd state on ml-staging-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:21:44] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [12:24:02] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 22 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [12:24:04] RECOVERY - Check systemd state on ml-staging-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:48] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:37:58] PROBLEM - Check systemd state on ml-staging-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:51:48] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [12:54:06] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 4 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [12:55:14] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:14:12] RECOVERY - Check systemd state on mw1445 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:15:04] RECOVERY - Check systemd state on ml-staging-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:21:58] PROBLEM - Check systemd state on ml-staging-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:22:42] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:46:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:52:04] RECOVERY - Check systemd state on ml-staging-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:54:12] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [13:56:32] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 4 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [14:13:00] PROBLEM - Check systemd state on ml-staging-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:39:17] https://en.wikibooks.org/wiki/Wikibooks:Reading_room/Administrative_Assistance#Unable_to_edit_page_-_%3A_World_Stamp_Catalogue%2FIndia [14:39:28] I try to remove a duplicate comment, and Firefox chokes [14:39:30] RECOVERY - Check systemd state on mw1446 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:39:38] "The page isn’t redirecting properly [14:39:38] Firefox has detected that the server is redirecting the request for this address in a way that will never complete. [14:39:38] This problem can sometimes be caused by disabling or refusing to accept cookies." [14:39:54] I turned OFF anything that could be blocking [14:40:08] What's actually gone wrong? [14:45:54] ShakespeareFan00: is this only in Firefox [14:46:01] * RhinosF1 can not reproduce in chrome [14:46:06] When did it start? [14:46:48] This is in Firefox Nightly [14:46:55] And it only started happening today [14:47:22] And did your Firefox update last night ShakespeareFan00 [14:47:59] I updated it this morning [14:48:19] You think it's a firefox glitch? [14:48:23] ShakespeareFan00: can you please report to Firefox then [14:48:41] Yep, we don't deploy on a weekend [14:48:42] I will consider doing so [14:48:47] And I can't see it on chrome [14:50:14] I can't either [14:57:44] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:04:04] RECOVERY - Check systemd state on ml-staging-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:11:02] PROBLEM - Check systemd state on ml-staging-ctrl2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. An error occured trying to list the failed units https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:35:50] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:48:06] RECOVERY - Check systemd state on ml-staging-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:20] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:52:20] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:55:04] PROBLEM - Check systemd state on ml-staging-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:58] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:56:58] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:59:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestagemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:21:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [16:21:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [16:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [16:27:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [16:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [16:31:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [16:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26330 and previous config saved to /var/cache/conftool/dbconfig/20220424-163151-ladsgroup.json [16:31:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:56] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [16:32:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [16:32:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [16:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 6 hosts with reason: Maintenance [16:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 6 hosts with reason: Maintenance [16:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:00] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [16:34:32] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:36:20] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 12 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [16:37:02] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:39:06] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 97 probes of 676 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:39:06] RECOVERY - Check systemd state on ml-staging-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:45:24] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 62 probes of 676 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:45:38] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [16:47:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [16:47:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [16:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T306560)', diff saved to https://phabricator.wikimedia.org/P26331 and previous config saved to /var/cache/conftool/dbconfig/20220424-164748-ladsgroup.json [16:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:53] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [16:50:10] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [16:54:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T306560)', diff saved to https://phabricator.wikimedia.org/P26332 and previous config saved to /var/cache/conftool/dbconfig/20220424-165439-ladsgroup.json [16:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:45] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [16:59:58] PROBLEM - Check systemd state on ml-staging-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P26333 and previous config saved to /var/cache/conftool/dbconfig/20220424-170945-ladsgroup.json [17:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:16] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 146 probes of 676 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:20:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26334 and previous config saved to /var/cache/conftool/dbconfig/20220424-172015-ladsgroup.json [17:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:22] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [17:21:34] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 65 probes of 676 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:23:44] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:24:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P26335 and previous config saved to /var/cache/conftool/dbconfig/20220424-172450-ladsgroup.json [17:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:02] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:32:58] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:35:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P26336 and previous config saved to /var/cache/conftool/dbconfig/20220424-173520-ladsgroup.json [17:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:34] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:39:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T306560)', diff saved to https://phabricator.wikimedia.org/P26337 and previous config saved to /var/cache/conftool/dbconfig/20220424-173955-ladsgroup.json [17:39:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [17:39:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [17:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:00] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [17:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:26] PROBLEM - SSH on wtp1035.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:45:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [17:45:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [17:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [17:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [17:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T306560)', diff saved to https://phabricator.wikimedia.org/P26338 and previous config saved to /var/cache/conftool/dbconfig/20220424-174553-ladsgroup.json [17:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:57] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [17:46:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:50:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P26339 and previous config saved to /var/cache/conftool/dbconfig/20220424-175025-ladsgroup.json [17:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T306560)', diff saved to https://phabricator.wikimedia.org/P26340 and previous config saved to /var/cache/conftool/dbconfig/20220424-175250-ladsgroup.json [17:52:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:54] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [18:00:02] RECOVERY - Check systemd state on ml-staging-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:00:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [18:00:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [18:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T298565)', diff saved to https://phabricator.wikimedia.org/P26341 and previous config saved to /var/cache/conftool/dbconfig/20220424-180013-ladsgroup.json [18:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:18] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [18:04:16] PROBLEM - IPMI Sensor Status on restbase1018 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [18:05:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26342 and previous config saved to /var/cache/conftool/dbconfig/20220424-180530-ladsgroup.json [18:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:35] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [18:05:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [18:05:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [18:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26343 and previous config saved to /var/cache/conftool/dbconfig/20220424-180555-ladsgroup.json [18:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:02] PROBLEM - Check systemd state on ml-staging-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:07:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P26344 and previous config saved to /var/cache/conftool/dbconfig/20220424-180755-ladsgroup.json [18:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:00] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:12:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T298565)', diff saved to https://phabricator.wikimedia.org/P26345 and previous config saved to /var/cache/conftool/dbconfig/20220424-181201-ladsgroup.json [18:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:06] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [18:23:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P26346 and previous config saved to /var/cache/conftool/dbconfig/20220424-182300-ladsgroup.json [18:23:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:16] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:27:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P26347 and previous config saved to /var/cache/conftool/dbconfig/20220424-182707-ladsgroup.json [18:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:38:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T306560)', diff saved to https://phabricator.wikimedia.org/P26348 and previous config saved to /var/cache/conftool/dbconfig/20220424-183805-ladsgroup.json [18:38:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [18:38:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [18:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:12] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [18:38:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T306560)', diff saved to https://phabricator.wikimedia.org/P26349 and previous config saved to /var/cache/conftool/dbconfig/20220424-183813-ladsgroup.json [18:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P26350 and previous config saved to /var/cache/conftool/dbconfig/20220424-184212-ladsgroup.json [18:42:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T306560)', diff saved to https://phabricator.wikimedia.org/P26351 and previous config saved to /var/cache/conftool/dbconfig/20220424-184503-ladsgroup.json [18:45:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:08] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [18:46:38] RECOVERY - SSH on wtp1035.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:47:12] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:50:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26352 and previous config saved to /var/cache/conftool/dbconfig/20220424-185056-ladsgroup.json [18:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:02] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [18:57:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T298565)', diff saved to https://phabricator.wikimedia.org/P26353 and previous config saved to /var/cache/conftool/dbconfig/20220424-185717-ladsgroup.json [18:57:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [18:57:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [18:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:22] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [18:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T298565)', diff saved to https://phabricator.wikimedia.org/P26354 and previous config saved to /var/cache/conftool/dbconfig/20220424-185724-ladsgroup.json [18:57:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:04] RECOVERY - Check systemd state on ml-staging-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:00:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P26355 and previous config saved to /var/cache/conftool/dbconfig/20220424-190008-ladsgroup.json [19:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T298565)', diff saved to https://phabricator.wikimedia.org/P26356 and previous config saved to /var/cache/conftool/dbconfig/20220424-190135-ladsgroup.json [19:01:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:06:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P26357 and previous config saved to /var/cache/conftool/dbconfig/20220424-190601-ladsgroup.json [19:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:02] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:12:00] PROBLEM - Check systemd state on ml-staging-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:15:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P26358 and previous config saved to /var/cache/conftool/dbconfig/20220424-191515-ladsgroup.json [19:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P26359 and previous config saved to /var/cache/conftool/dbconfig/20220424-191641-ladsgroup.json [19:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P26360 and previous config saved to /var/cache/conftool/dbconfig/20220424-192106-ladsgroup.json [19:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:58] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:30:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T306560)', diff saved to https://phabricator.wikimedia.org/P26361 and previous config saved to /var/cache/conftool/dbconfig/20220424-193020-ladsgroup.json [19:30:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [19:30:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [19:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:26] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [19:30:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T306560)', diff saved to https://phabricator.wikimedia.org/P26362 and previous config saved to /var/cache/conftool/dbconfig/20220424-193028-ladsgroup.json [19:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P26363 and previous config saved to /var/cache/conftool/dbconfig/20220424-193146-ladsgroup.json [19:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26364 and previous config saved to /var/cache/conftool/dbconfig/20220424-193611-ladsgroup.json [19:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:16] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [19:36:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [19:36:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [19:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T298565)', diff saved to https://phabricator.wikimedia.org/P26365 and previous config saved to /var/cache/conftool/dbconfig/20220424-193635-ladsgroup.json [19:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T298565)', diff saved to https://phabricator.wikimedia.org/P26366 and previous config saved to /var/cache/conftool/dbconfig/20220424-194651-ladsgroup.json [19:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:56] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [19:46:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [19:47:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [19:47:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T298565)', diff saved to https://phabricator.wikimedia.org/P26367 and previous config saved to /var/cache/conftool/dbconfig/20220424-194705-ladsgroup.json [19:47:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:38] PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:50:40] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [19:51:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T298565)', diff saved to https://phabricator.wikimedia.org/P26368 and previous config saved to /var/cache/conftool/dbconfig/20220424-195115-ladsgroup.json [19:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:02] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 4 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [19:59:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestagemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:03:02] RECOVERY - Check systemd state on ml-staging-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:06:02] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:06:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P26369 and previous config saved to /var/cache/conftool/dbconfig/20220424-200620-ladsgroup.json [20:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:00] PROBLEM - Check systemd state on ml-staging-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:13:00] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:20:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298565)', diff saved to https://phabricator.wikimedia.org/P26370 and previous config saved to /var/cache/conftool/dbconfig/20220424-202006-ladsgroup.json [20:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:11] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:21:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P26371 and previous config saved to /var/cache/conftool/dbconfig/20220424-202125-ladsgroup.json [20:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T306560)', diff saved to https://phabricator.wikimedia.org/P26372 and previous config saved to /var/cache/conftool/dbconfig/20220424-203042-ladsgroup.json [20:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:47] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [20:35:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P26373 and previous config saved to /var/cache/conftool/dbconfig/20220424-203511-ladsgroup.json [20:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T298565)', diff saved to https://phabricator.wikimedia.org/P26374 and previous config saved to /var/cache/conftool/dbconfig/20220424-203630-ladsgroup.json [20:36:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [20:36:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [20:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:36] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T298565)', diff saved to https://phabricator.wikimedia.org/P26375 and previous config saved to /var/cache/conftool/dbconfig/20220424-203639-ladsgroup.json [20:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P26376 and previous config saved to /var/cache/conftool/dbconfig/20220424-204547-ladsgroup.json [20:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T298565)', diff saved to https://phabricator.wikimedia.org/P26377 and previous config saved to /var/cache/conftool/dbconfig/20220424-204910-ladsgroup.json [20:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:15] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:50:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P26378 and previous config saved to /var/cache/conftool/dbconfig/20220424-205016-ladsgroup.json [20:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:52] RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:00:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P26379 and previous config saved to /var/cache/conftool/dbconfig/20220424-210052-ladsgroup.json [21:00:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P26380 and previous config saved to /var/cache/conftool/dbconfig/20220424-210415-ladsgroup.json [21:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298565)', diff saved to https://phabricator.wikimedia.org/P26381 and previous config saved to /var/cache/conftool/dbconfig/20220424-210521-ladsgroup.json [21:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:26] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:13:31] PROBLEM - LVS jobrunner eqiad port 443/tcp - JobRunner LVS interface -https-. jobrunner.svc.eqiad.wmnet IPv4 #page on jobrunner.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:15:02] RECOVERY - Check systemd state on ml-staging-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:15:43] RECOVERY - LVS jobrunner eqiad port 443/tcp - JobRunner LVS interface -https-. jobrunner.svc.eqiad.wmnet IPv4 #page on jobrunner.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 400 bytes in 1.323 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:15:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T306560)', diff saved to https://phabricator.wikimedia.org/P26382 and previous config saved to /var/cache/conftool/dbconfig/20220424-211559-ladsgroup.json [21:16:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [21:16:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [21:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:05] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [21:16:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T306560)', diff saved to https://phabricator.wikimedia.org/P26383 and previous config saved to /var/cache/conftool/dbconfig/20220424-211607-ladsgroup.json [21:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:59] still, eh :/ [21:18:50] wait no, that's jobrunners not videoscalers, something new might be up [21:19:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P26384 and previous config saved to /var/cache/conftool/dbconfig/20220424-211920-ladsgroup.json [21:19:20] any other SREs looking? [21:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:02] PROBLEM - Check systemd state on ml-staging-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:34:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T298565)', diff saved to https://phabricator.wikimedia.org/P26385 and previous config saved to /var/cache/conftool/dbconfig/20220424-213425-ladsgroup.json [21:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:31] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:34:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [21:34:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [21:34:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T298565)', diff saved to https://phabricator.wikimedia.org/P26386 and previous config saved to /var/cache/conftool/dbconfig/20220424-213440-ladsgroup.json [21:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:47:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T298565)', diff saved to https://phabricator.wikimedia.org/P26387 and previous config saved to /var/cache/conftool/dbconfig/20220424-214717-ladsgroup.json [21:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:23] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:02:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P26388 and previous config saved to /var/cache/conftool/dbconfig/20220424-220222-ladsgroup.json [22:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:08] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:16:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T306560)', diff saved to https://phabricator.wikimedia.org/P26389 and previous config saved to /var/cache/conftool/dbconfig/20220424-221621-ladsgroup.json [22:16:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:26] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [22:17:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P26390 and previous config saved to /var/cache/conftool/dbconfig/20220424-221727-ladsgroup.json [22:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:08] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:27:06] RECOVERY - Check systemd state on ml-staging-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:31:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P26391 and previous config saved to /var/cache/conftool/dbconfig/20220424-223126-ladsgroup.json [22:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T298565)', diff saved to https://phabricator.wikimedia.org/P26392 and previous config saved to /var/cache/conftool/dbconfig/20220424-223232-ladsgroup.json [22:32:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [22:32:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [22:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:37] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [22:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T298565)', diff saved to https://phabricator.wikimedia.org/P26393 and previous config saved to /var/cache/conftool/dbconfig/20220424-223240-ladsgroup.json [22:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:36:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T298565)', diff saved to https://phabricator.wikimedia.org/P26394 and previous config saved to /var/cache/conftool/dbconfig/20220424-223650-ladsgroup.json [22:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:48] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:41:04] PROBLEM - Check systemd state on ml-staging-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:44:04] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:46:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P26395 and previous config saved to /var/cache/conftool/dbconfig/20220424-224631-ladsgroup.json [22:46:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:02] RECOVERY - Check systemd state on ml-staging-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:50:58] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:51:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P26396 and previous config saved to /var/cache/conftool/dbconfig/20220424-225155-ladsgroup.json [22:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:56] PROBLEM - Check systemd state on ml-staging-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:01:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T306560)', diff saved to https://phabricator.wikimedia.org/P26397 and previous config saved to /var/cache/conftool/dbconfig/20220424-230136-ladsgroup.json [23:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:42] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [23:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:07:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P26398 and previous config saved to /var/cache/conftool/dbconfig/20220424-230700-ladsgroup.json [23:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:30] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:22:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T298565)', diff saved to https://phabricator.wikimedia.org/P26399 and previous config saved to /var/cache/conftool/dbconfig/20220424-232205-ladsgroup.json [23:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:11] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [23:22:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [23:22:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [23:22:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T298565)', diff saved to https://phabricator.wikimedia.org/P26400 and previous config saved to /var/cache/conftool/dbconfig/20220424-232219-ladsgroup.json [23:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:12] PROBLEM - Router interfaces on cr3-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:26:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T298565)', diff saved to https://phabricator.wikimedia.org/P26401 and previous config saved to /var/cache/conftool/dbconfig/20220424-232629-ladsgroup.json [23:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:04] RECOVERY - Check systemd state on ml-staging-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:33:58] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:37:52] PROBLEM - Check systemd state on ml-staging-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:38:54] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:41:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P26402 and previous config saved to /var/cache/conftool/dbconfig/20220424-234134-ladsgroup.json [23:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:00] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:43:02] PROBLEM - MariaDB Replica IO: x1 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2013, Errmsg: error reconnecting to master repl@db2096.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Lost connection to MySQL server at reading authorization packet, system error: 104 Connection reset by peer https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:44:00] PROBLEM - MariaDB Replica IO: s2 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2104.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:46:18] RECOVERY - MariaDB Replica IO: s2 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:47:38] RECOVERY - MariaDB Replica IO: x1 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:56:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P26403 and previous config saved to /var/cache/conftool/dbconfig/20220424-235639-ladsgroup.json [23:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestagemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown