[00:11:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T298565)', diff saved to https://phabricator.wikimedia.org/P26404 and previous config saved to /var/cache/conftool/dbconfig/20220425-001144-ladsgroup.json [00:11:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [00:11:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [00:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:50] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [00:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T298565)', diff saved to https://phabricator.wikimedia.org/P26405 and previous config saved to /var/cache/conftool/dbconfig/20220425-001152-ladsgroup.json [00:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:10] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:22:04] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:24:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T298565)', diff saved to https://phabricator.wikimedia.org/P26406 and previous config saved to /var/cache/conftool/dbconfig/20220425-002422-ladsgroup.json [00:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:27] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [00:32:50] (03PS1) 10Andrew Bogott: ceph: allow ceph user to access smart utils [puppet] - 10https://gerrit.wikimedia.org/r/785390 [00:34:29] (03PS2) 10Andrew Bogott: ceph: allow ceph user to access smart utils [puppet] - 10https://gerrit.wikimedia.org/r/785390 [00:36:28] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [00:37:29] (03CR) 10Andrew Bogott: "https://puppet-compiler.wmflabs.org/pcc-worker1003/34951/" [puppet] - 10https://gerrit.wikimedia.org/r/785390 (owner: 10Andrew Bogott) [00:38:50] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 4 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [00:39:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P26407 and previous config saved to /var/cache/conftool/dbconfig/20220425-003927-ladsgroup.json [00:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:44] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [00:48:04] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 22 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [00:54:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P26408 and previous config saved to /var/cache/conftool/dbconfig/20220425-005432-ladsgroup.json [00:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:56:58] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:09:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T298565)', diff saved to https://phabricator.wikimedia.org/P26409 and previous config saved to /var/cache/conftool/dbconfig/20220425-010938-ladsgroup.json [01:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:09:43] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [01:09:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [01:09:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [01:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:09:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T298565)', diff saved to https://phabricator.wikimedia.org/P26410 and previous config saved to /var/cache/conftool/dbconfig/20220425-010952-ladsgroup.json [01:09:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:10:58] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:13:59] (03PS1) 10Andrew Bogott: Prepare cloudweb2002-dev to replace cloudweb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/785391 (https://phabricator.wikimedia.org/T304881) [01:16:00] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [01:17:04] (03CR) 10Andrew Bogott: [C: 03+2] Prepare cloudweb2002-dev to replace cloudweb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/785391 (https://phabricator.wikimedia.org/T304881) (owner: 10Andrew Bogott) [01:20:36] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 3 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [01:33:00] PROBLEM - Check systemd state on cloudweb2001-dev is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:39:17] !log andrew@deploy1002 Started deploy [horizon/deploy@9d02cd6]: update new codfw1dev host [01:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:39:39] !log andrew@deploy1002 Started deploy [horizon/deploy@9d02cd6]: update new codfw1dev host [01:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:40:33] !log andrew@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: update new codfw1dev host (duration: 00m 54s) [01:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:40:45] (JobUnavailable) firing: (5) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:02] ACKNOWLEDGEMENT - Check systemd state on cloudweb2001-dev is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service andrew bogott preparing this host for decom https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:41:12] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:44:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T298565)', diff saved to https://phabricator.wikimedia.org/P26411 and previous config saved to /var/cache/conftool/dbconfig/20220425-014457-ladsgroup.json [01:45:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:45:02] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [01:45:45] (JobUnavailable) firing: (5) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:48:54] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:50:38] RECOVERY - Memcached on cloudweb2002-dev is OK: TCP OK - 0.032 second response time on 208.80.153.41 port 11000 https://wikitech.wikimedia.org/wiki/Memcached [01:52:50] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:00:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P26412 and previous config saved to /var/cache/conftool/dbconfig/20220425-020002-ladsgroup.json [02:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:03:12] RECOVERY - Check systemd state on cloudweb2001-dev is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:40] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [02:15:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P26413 and previous config saved to /var/cache/conftool/dbconfig/20220425-021507-ladsgroup.json [02:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:16:20] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 7 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [02:25:22] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:30:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T298565)', diff saved to https://phabricator.wikimedia.org/P26414 and previous config saved to /var/cache/conftool/dbconfig/20220425-023012-ladsgroup.json [02:30:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [02:30:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [02:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:18] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [02:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T298565)', diff saved to https://phabricator.wikimedia.org/P26415 and previous config saved to /var/cache/conftool/dbconfig/20220425-023020-ladsgroup.json [02:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:26] PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation={get,listWithCount} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [02:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:34:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T298565)', diff saved to https://phabricator.wikimedia.org/P26416 and previous config saved to /var/cache/conftool/dbconfig/20220425-023429-ladsgroup.json [02:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:34:46] RECOVERY - etcd request latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [02:35:20] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:37:02] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:41:32] (03PS1) 10Andrew Bogott: Update grants to reflect replacement of cloudweb2001 with cloudweb2002 [puppet] - 10https://gerrit.wikimedia.org/r/785395 [02:49:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P26417 and previous config saved to /var/cache/conftool/dbconfig/20220425-024934-ladsgroup.json [02:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:02:38] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:04:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P26418 and previous config saved to /var/cache/conftool/dbconfig/20220425-030439-ladsgroup.json [03:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:15:18] (03PS1) 10Andrew Bogott: labtestwiki: update test lab server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785399 (https://phabricator.wikimedia.org/T304881) [03:18:56] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:19:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T298565)', diff saved to https://phabricator.wikimedia.org/P26419 and previous config saved to /var/cache/conftool/dbconfig/20220425-031944-ladsgroup.json [03:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:19:50] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [03:19:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [03:19:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [03:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:19:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T298565)', diff saved to https://phabricator.wikimedia.org/P26420 and previous config saved to /var/cache/conftool/dbconfig/20220425-031959-ladsgroup.json [03:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:21:18] (03PS2) 10Andrew Bogott: labtestwiki: update labtest ldap server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785399 (https://phabricator.wikimedia.org/T304881) [03:23:22] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:24:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T298565)', diff saved to https://phabricator.wikimedia.org/P26421 and previous config saved to /var/cache/conftool/dbconfig/20220425-032410-ladsgroup.json [03:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:24:58] PROBLEM - Query Service HTTP Port on wdqs1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [03:25:32] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.095 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:25:50] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:27:16] RECOVERY - Query Service HTTP Port on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.017 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [03:39:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P26422 and previous config saved to /var/cache/conftool/dbconfig/20220425-033916-ladsgroup.json [03:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:44:28] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:51:24] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:54:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P26423 and previous config saved to /var/cache/conftool/dbconfig/20220425-035421-ladsgroup.json [03:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:59:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestagemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:09:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T298565)', diff saved to https://phabricator.wikimedia.org/P26424 and previous config saved to /var/cache/conftool/dbconfig/20220425-040926-ladsgroup.json [04:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:09:32] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [04:09:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [04:09:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [04:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:09:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T298565)', diff saved to https://phabricator.wikimedia.org/P26425 and previous config saved to /var/cache/conftool/dbconfig/20220425-040940-ladsgroup.json [04:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:13:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T298565)', diff saved to https://phabricator.wikimedia.org/P26426 and previous config saved to /var/cache/conftool/dbconfig/20220425-041347-ladsgroup.json [04:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:28:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P26427 and previous config saved to /var/cache/conftool/dbconfig/20220425-042852-ladsgroup.json [04:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:43:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P26428 and previous config saved to /var/cache/conftool/dbconfig/20220425-044357-ladsgroup.json [04:44:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:59:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T298565)', diff saved to https://phabricator.wikimedia.org/P26429 and previous config saved to /var/cache/conftool/dbconfig/20220425-045902-ladsgroup.json [04:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:59:07] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [05:07:14] PROBLEM - Check systemd state on ms-be1064 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:10:34] (03PS1) 10Marostegui: Revert "db2088: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/785355 [05:11:13] (03CR) 10Marostegui: [C: 03+2] Revert "db2088: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/785355 (owner: 10Marostegui) [05:20:30] (03PS1) 10Marostegui: mariadb: Promote db1162 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/785602 (https://phabricator.wikimedia.org/T306417) [05:20:57] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/785602 (https://phabricator.wikimedia.org/T306417) (owner: 10Marostegui) [05:21:40] (03PS1) 10Marostegui: wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/785603 (https://phabricator.wikimedia.org/T306417) [05:22:07] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [dns] - 10https://gerrit.wikimedia.org/r/785603 (https://phabricator.wikimedia.org/T306417) (owner: 10Marostegui) [05:46:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:50:24] (03PS1) 10Marostegui: db1132: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/785604 [05:51:05] (03CR) 10Marostegui: [C: 03+2] db1132: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/785604 (owner: 10Marostegui) [06:01:44] RECOVERY - Check systemd state on ms-be1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:10:06] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:11:44] !log depooling and restarting blazegraph on wdqs1007 (deadlocked for 4+days) [06:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:46] RECOVERY - WDQS SPARQL on wdqs1007 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.106 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:14:08] RECOVERY - Query Service HTTP Port on wdqs1007 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.030 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [06:16:58] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:46] (03CR) 10Ladsgroup: [C: 03+1] wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/785603 (https://phabricator.wikimedia.org/T306417) (owner: 10Marostegui) [06:33:24] (03CR) 10Ladsgroup: [C: 03+1] mariadb: Promote db1162 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/785602 (https://phabricator.wikimedia.org/T306417) (owner: 10Marostegui) [06:34:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [06:34:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [06:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T306560)', diff saved to https://phabricator.wikimedia.org/P26430 and previous config saved to /var/cache/conftool/dbconfig/20220425-063409-ladsgroup.json [06:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:14] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [06:36:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T306560)', diff saved to https://phabricator.wikimedia.org/P26431 and previous config saved to /var/cache/conftool/dbconfig/20220425-063634-ladsgroup.json [06:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:40] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:38:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1132 into s1 T301879', diff saved to https://phabricator.wikimedia.org/P26432 and previous config saved to /var/cache/conftool/dbconfig/20220425-063823-marostegui.json [06:38:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:28] T301879: Test MariaDB 10.6 on Bullseye - https://phabricator.wikimedia.org/T301879 [06:51:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P26433 and previous config saved to /var/cache/conftool/dbconfig/20220425-065139-ladsgroup.json [06:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host irc1001.wikimedia.org [06:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [06:55:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [06:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T306560)', diff saved to https://phabricator.wikimedia.org/P26434 and previous config saved to /var/cache/conftool/dbconfig/20220425-065559-ladsgroup.json [06:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:04] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [06:58:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc1001.wikimedia.org [06:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:58] jouncebot: nowandnext [06:59:58] For the next 0 hour(s) and 0 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220424T0700) [06:59:58] In 0 hour(s) and 0 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220425T0700) [07:00:05] Amir1, awight, Urbanecm, and taavi: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220425T0700). [07:00:05] irc-koi: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:44] hi? [07:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:02:13] I'm reading the tickets [07:02:22] PROBLEM - SSH on wtp1035.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:02:54] not sure the first one can be done. T44473 is still open [07:02:54] T44473: Allow upload-by-URL from upload.wikimedia.org - https://phabricator.wikimedia.org/T44473 [07:03:06] (03CR) 10Ladsgroup: [C: 04-1] "T44473 is still open." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785229 (https://phabricator.wikimedia.org/T142991) (owner: 10Stang) [07:03:21] (03CR) 10Ladsgroup: [C: 03+2] Add tothemoon.ser.asu.edu to the wgCopyUploadsDomains allowlist of commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785326 (https://phabricator.wikimedia.org/T306671) (owner: 10Stang) [07:03:40] I see T303577 got a merge last week so [07:03:41] T303577: "uploader" group for viwiki - https://phabricator.wikimedia.org/T303577 [07:03:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T306560)', diff saved to https://phabricator.wikimedia.org/P26435 and previous config saved to /var/cache/conftool/dbconfig/20220425-070348-ladsgroup.json [07:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:53] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [07:04:06] (03Merged) 10jenkins-bot: Add tothemoon.ser.asu.edu to the wgCopyUploadsDomains allowlist of commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785326 (https://phabricator.wikimedia.org/T306671) (owner: 10Stang) [07:04:57] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/785208 [07:05:39] hmm, I mean https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/784717 [07:05:55] (03PS2) 10Ladsgroup: ActorMigration: Start reading from rev_actor field in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784685 (https://phabricator.wikimedia.org/T275246) [07:06:20] I think that was also a mistake [07:06:21] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:785326|Add tothemoon.ser.asu.edu to the wgCopyUploadsDomains allowlist of commonswiki (T306671)]] (duration: 00m 52s) [07:06:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:26] T306671: Please add tothemoon.ser.asu.edu to the wgCopyUploadsDomains allowlist - https://phabricator.wikimedia.org/T306671 [07:06:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P26436 and previous config saved to /var/cache/conftool/dbconfig/20220425-070644-ladsgroup.json [07:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:58] (03CR) 10Ladsgroup: [C: 03+2] ActorMigration: Start reading from rev_actor field in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784685 (https://phabricator.wikimedia.org/T275246) (owner: 10Ladsgroup) [07:07:37] koi: the allowlist one is done, I don't think we can do the second one [07:08:10] (03Merged) 10jenkins-bot: ActorMigration: Start reading from rev_actor field in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784685 (https://phabricator.wikimedia.org/T275246) (owner: 10Ladsgroup) [07:08:26] well, fine, I would wait for that ticket :( [07:08:35] anyway thanks [07:08:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetmaster2004.codfw.wmnet [07:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:11:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:13] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:784685|ActorMigration: Start reading from rev_actor field in group0 (T275246)]] (duration: 00m 50s) [07:11:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:18] T275246: Populate rev_actor and rev_comment_id - https://phabricator.wikimedia.org/T275246 [07:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetmaster2004.codfw.wmnet [07:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:05] (03PS1) 10Marostegui: pc2014: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/785794 (https://phabricator.wikimedia.org/T306777) [07:15:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetmaster2005.codfw.wmnet [07:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:23] (03CR) 10Marostegui: [C: 03+2] pc2014: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/785794 (https://phabricator.wikimedia.org/T306777) (owner: 10Marostegui) [07:18:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P26437 and previous config saved to /var/cache/conftool/dbconfig/20220425-071853-ladsgroup.json [07:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetmaster2005.codfw.wmnet [07:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:21:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:21:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T306560)', diff saved to https://phabricator.wikimedia.org/P26438 and previous config saved to /var/cache/conftool/dbconfig/20220425-072149-ladsgroup.json [07:21:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [07:21:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [07:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:54] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [07:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T306560)', diff saved to https://phabricator.wikimedia.org/P26439 and previous config saved to /var/cache/conftool/dbconfig/20220425-072157-ladsgroup.json [07:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host seaborgium.wikimedia.org [07:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T306560)', diff saved to https://phabricator.wikimedia.org/P26440 and previous config saved to /var/cache/conftool/dbconfig/20220425-072323-ladsgroup.json [07:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:26:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:58] (03CR) 10David Caro: [C: 03+2] Add debian 11 testing support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/778477 (owner: 10David Caro) [07:31:30] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host seaborgium.wikimedia.org [07:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:48] PROBLEM - Check systemd state on seaborgium is CRITICAL: CRITICAL - degraded: The following units failed: ifup@eth0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:33:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P26441 and previous config saved to /var/cache/conftool/dbconfig/20220425-073359-ladsgroup.json [07:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:56] (03PS1) 10Marostegui: pc2014: Install MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/785796 (https://phabricator.wikimedia.org/T306777) [07:38:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P26442 and previous config saved to /var/cache/conftool/dbconfig/20220425-073828-ladsgroup.json [07:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:36] (03CR) 10Marostegui: [C: 03+2] pc2014: Install MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/785796 (https://phabricator.wikimedia.org/T306777) (owner: 10Marostegui) [07:43:19] (03PS1) 10Marostegui: parsercache.my.cnf: Leave rowid disabled [puppet] - 10https://gerrit.wikimedia.org/r/785798 (https://phabricator.wikimedia.org/T306777) [07:44:23] (03CR) 10Marostegui: [C: 03+2] parsercache.my.cnf: Leave rowid disabled [puppet] - 10https://gerrit.wikimedia.org/r/785798 (https://phabricator.wikimedia.org/T306777) (owner: 10Marostegui) [07:44:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host apt1001.wikimedia.org [07:45:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:02] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:49:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T306560)', diff saved to https://phabricator.wikimedia.org/P26443 and previous config saved to /var/cache/conftool/dbconfig/20220425-074904-ladsgroup.json [07:49:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [07:49:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [07:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:09] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [07:49:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T306560)', diff saved to https://phabricator.wikimedia.org/P26444 and previous config saved to /var/cache/conftool/dbconfig/20220425-074912-ladsgroup.json [07:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host search-loader1001.eqiad.wmnet [07:50:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt1001.wikimedia.org [07:50:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:39] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1144.eqiad.wmnet with reason: Rebooting for T303174 [07:50:40] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1144.eqiad.wmnet with reason: Rebooting for T303174 [07:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:46] !log kormat@cumin1001 dbctl commit (dc=all): 'db1144:3314 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P26445 and previous config saved to /var/cache/conftool/dbconfig/20220425-075045-kormat.json [07:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:07] !log kormat@cumin1001 dbctl commit (dc=all): 'db1144:3315 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P26446 and previous config saved to /var/cache/conftool/dbconfig/20220425-075106-kormat.json [07:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host search-loader1001.eqiad.wmnet [07:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:34] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host search-loader2001.codfw.wmnet [07:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:45] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:53:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P26447 and previous config saved to /var/cache/conftool/dbconfig/20220425-075333-ladsgroup.json [07:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:13] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:57:40] !log kormat@cumin1001 dbctl commit (dc=all): 'db1144:3314 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P26448 and previous config saved to /var/cache/conftool/dbconfig/20220425-075740-kormat.json [07:57:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T306560)', diff saved to https://phabricator.wikimedia.org/P26449 and previous config saved to /var/cache/conftool/dbconfig/20220425-075801-ladsgroup.json [07:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:07] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [07:58:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host search-loader2001.codfw.wmnet [07:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestagemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:06:05] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:07:31] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:08:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T306560)', diff saved to https://phabricator.wikimedia.org/P26450 and previous config saved to /var/cache/conftool/dbconfig/20220425-080838-ladsgroup.json [08:08:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [08:08:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [08:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:43] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [08:08:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [08:09:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [08:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T306560)', diff saved to https://phabricator.wikimedia.org/P26451 and previous config saved to /var/cache/conftool/dbconfig/20220425-080910-ladsgroup.json [08:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:31] (03CR) 10David Caro: [C: 04-1] "That file is added by ceph-osd:" [puppet] - 10https://gerrit.wikimedia.org/r/785390 (owner: 10Andrew Bogott) [08:11:05] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:11:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T306560)', diff saved to https://phabricator.wikimedia.org/P26452 and previous config saved to /var/cache/conftool/dbconfig/20220425-081135-ladsgroup.json [08:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:44] !log kormat@cumin1001 dbctl commit (dc=all): 'db1144:3314 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P26453 and previous config saved to /var/cache/conftool/dbconfig/20220425-081244-kormat.json [08:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P26454 and previous config saved to /var/cache/conftool/dbconfig/20220425-081307-ladsgroup.json [08:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:01] RECOVERY - Check systemd state on dumpsdata1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:14:35] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:16:51] (03PS1) 10Muehlenhoff: Point irc.w.o to irc1001 [dns] - 10https://gerrit.wikimedia.org/r/785803 [08:19:24] (03CR) 10David Caro: [C: 03+1] P:openstack::rabbitmq: fix file permissions [puppet] - 10https://gerrit.wikimedia.org/r/785105 (https://phabricator.wikimedia.org/T297268) (owner: 10Majavah) [08:20:03] PROBLEM - Check systemd state on dumpsdata1002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rasdaemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:23:23] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P26455 and previous config saved to /var/cache/conftool/dbconfig/20220425-082640-ladsgroup.json [08:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:48] !log kormat@cumin1001 dbctl commit (dc=all): 'db1144:3314 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P26456 and previous config saved to /var/cache/conftool/dbconfig/20220425-082747-kormat.json [08:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:59] (03CR) 10David Caro: [C: 03+1] kubeadm: remove metrics files [puppet] - 10https://gerrit.wikimedia.org/r/783418 (owner: 10Majavah) [08:28:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P26457 and previous config saved to /var/cache/conftool/dbconfig/20220425-082813-ladsgroup.json [08:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:33] (03CR) 10David Caro: [C: 03+2] P:openstack::rabbitmq: fix file permissions [puppet] - 10https://gerrit.wikimedia.org/r/785105 (https://phabricator.wikimedia.org/T297268) (owner: 10Majavah) [08:31:17] (03CR) 10David Caro: [C: 03+2] kubeadm: remove metrics files [puppet] - 10https://gerrit.wikimedia.org/r/783418 (owner: 10Majavah) [08:32:07] (03CR) 10Muehlenhoff: [C: 03+2] Point irc.w.o to irc1001 [dns] - 10https://gerrit.wikimedia.org/r/785803 (owner: 10Muehlenhoff) [08:32:31] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:41:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P26458 and previous config saved to /var/cache/conftool/dbconfig/20220425-084145-ladsgroup.json [08:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:52] !log kormat@cumin1001 dbctl commit (dc=all): 'db1144:3314 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P26459 and previous config saved to /var/cache/conftool/dbconfig/20220425-084251-kormat.json [08:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:56] !log kormat@cumin1001 dbctl commit (dc=all): 'db1144:3315 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P26460 and previous config saved to /var/cache/conftool/dbconfig/20220425-084256-kormat.json [08:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T306560)', diff saved to https://phabricator.wikimedia.org/P26461 and previous config saved to /var/cache/conftool/dbconfig/20220425-084318-ladsgroup.json [08:43:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:22] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [08:44:03] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:48:11] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 9 hosts with reason: Rebooting db1154 T303174 [08:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:18] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 9 hosts with reason: Rebooting db1154 T303174 [08:48:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:24] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1154.eqiad.wmnet with reason: Rebooting for T303174 [08:48:26] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1154.eqiad.wmnet with reason: Rebooting for T303174 [08:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:57] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:51:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host dumpsdata1007.eqiad.wmnet [08:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:11] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-master1002.eqiad.wmnet [08:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:00] !log restart varnish and ats on cp2037 to clear daemon restart alerts [08:55:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:30] sigh.. cp2032 actually [08:55:35] E_COFFEE /o\ [08:56:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T306560)', diff saved to https://phabricator.wikimedia.org/P26462 and previous config saved to /var/cache/conftool/dbconfig/20220425-085650-ladsgroup.json [08:56:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [08:56:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [08:56:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:56] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [08:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [08:57:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [08:57:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [08:57:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [08:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [08:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [08:57:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:00] !log kormat@cumin1001 dbctl commit (dc=all): 'db1144:3315 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P26463 and previous config saved to /var/cache/conftool/dbconfig/20220425-085759-kormat.json [08:58:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1007.eqiad.wmnet [08:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [08:58:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [08:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T306560)', diff saved to https://phabricator.wikimedia.org/P26464 and previous config saved to /var/cache/conftool/dbconfig/20220425-085822-ladsgroup.json [08:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:07] RECOVERY - traffic_server backend process restarted on cp2032 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=codfw+prometheus/ops&var-instance=cp2032&var-layer=backend [08:59:07] RECOVERY - Varnish frontend child restarted on cp2032 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/d/000000330/varnish-machine-stats?orgId=1&viewPanel=66&var-server=cp2032&var-datasource=codfw+prometheus/ops [08:59:21] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-master1002.eqiad.wmnet [08:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T306560)', diff saved to https://phabricator.wikimedia.org/P26465 and previous config saved to /var/cache/conftool/dbconfig/20220425-090037-ladsgroup.json [09:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:20] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10MoritzMuehlenhoff) That's promising progress! I have rebooted dumpsdata1007 back into Linux 5.10. This is the standard kernel we're running on Bullseye (we can ignore Buster/Stretch for testing the PERC... [09:06:43] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1155.eqiad.wmnet with reason: Rebooting for T303174 [09:06:45] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1155.eqiad.wmnet with reason: Rebooting for T303174 [09:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:49] (03CR) 10Kosta Harlan: [C: 03+1] GrowthExperiments: Do not use 'facebook' in campaign pattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785245 (https://phabricator.wikimedia.org/T303785) (owner: 10Gergő Tisza) [09:13:04] !log kormat@cumin1001 dbctl commit (dc=all): 'db1144:3315 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P26466 and previous config saved to /var/cache/conftool/dbconfig/20220425-091303-kormat.json [09:13:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:03] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:15:11] (03PS1) 10Majavah: hieradata: enable tls on eqiad1 rabbitmq too [puppet] - 10https://gerrit.wikimedia.org/r/785809 (https://phabricator.wikimedia.org/T297268) [09:15:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P26467 and previous config saved to /var/cache/conftool/dbconfig/20220425-091542-ladsgroup.json [09:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:30] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34952/console" [puppet] - 10https://gerrit.wikimedia.org/r/785809 (https://phabricator.wikimedia.org/T297268) (owner: 10Majavah) [09:16:42] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host mw2412.codfw.wmnet [09:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:59] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:21:01] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:23:45] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mw2412.codfw.wmnet [09:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:23] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host mw2413.codfw.wmnet [09:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:23] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/775282 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:28:08] !log kormat@cumin1001 dbctl commit (dc=all): 'db1144:3315 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P26468 and previous config saved to /var/cache/conftool/dbconfig/20220425-092807-kormat.json [09:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P26469 and previous config saved to /var/cache/conftool/dbconfig/20220425-093047-ladsgroup.json [09:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:55] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mw2413.codfw.wmnet [09:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:56] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host mw2414.codfw.wmnet [09:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:11] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mw2414.codfw.wmnet [09:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:59] ^ fyi: all of that hosts mw2412-mw2419 are insetup and not pooled in any cluster [09:37:11] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host mw2415.codfw.wmnet [09:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:15] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/776929 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:41:45] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host testvm2004.codfw.wmnet [09:41:46] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:44] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mw2415.codfw.wmnet [09:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:20] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host mw2416.codfw.wmnet [09:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T306560)', diff saved to https://phabricator.wikimedia.org/P26470 and previous config saved to /var/cache/conftool/dbconfig/20220425-094552-ladsgroup.json [09:45:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [09:45:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [09:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:57] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [09:46:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:46:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T306560)', diff saved to https://phabricator.wikimedia.org/P26471 and previous config saved to /var/cache/conftool/dbconfig/20220425-094600-ladsgroup.json [09:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:11] I'm seeing slow chunked uploads to Commons: ~750kB JPEG uploads in ~2 seconds, then takes almost 5 min. to be reassembled and published. [09:46:53] (03PS2) 10Jbond: admin: fix a few indentation issues [puppet] - 10https://gerrit.wikimedia.org/r/778498 (owner: 10Zabe) [09:47:11] Swift having a bad day? Network issue between proxies and jobrunners? Jobrunners choking on something? [09:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [09:48:07] xover: could you share the response headers for such upload? stripping your IP if nedeed of course [09:48:14] *needed [09:49:13] This is using chunked uploading using Rijlke's script (com:BigChunkedUpload.js) so I don't have protocol level data. [09:50:23] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mw2416.codfw.wmnet [09:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:48] AIUI it's using the API, and it reports when the upload (client to proxy) is done and starts waiting for the file to be reassembled server side. That latter is what sits waiting for ~5 min. [09:51:34] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host mw2417.codfw.wmnet [09:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:42] (03CR) 10Jbond: [C: 03+1] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/773272 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [09:55:37] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1137.eqiad.wmnet with reason: Rebooting for T303174 [09:55:38] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1137.eqiad.wmnet with reason: Rebooting for T303174 [09:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:44] !log kormat@cumin1001 dbctl commit (dc=all): 'db1137 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P26472 and previous config saved to /var/cache/conftool/dbconfig/20220425-095543-kormat.json [09:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:02] Previous incident with roughly similar symptoms was https://wikitech.wikimedia.org/wiki/Incidents/2021-11-04_large_file_upload_timeouts but the symptoms this time differ in the details. [09:56:16] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mw2417.codfw.wmnet [09:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:41] xover: hmm but you mentioned 750kB uploads [09:56:47] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host mw2418.codfw.wmnet [09:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:45] vgutierrez: Yes, I routinely upload >100MB files and like to manually specify the file desc templates so I just use Rijlke's script rotinely. [09:57:55] *routinely [10:00:27] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [10:02:30] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mw2418.codfw.wmnet [10:02:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:55] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host mw2419.codfw.wmnet [10:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:57] !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [10:04:59] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:37] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mw2419.codfw.wmnet [10:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:59] !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [10:14:00] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host testvm2004.codfw.wmnet [10:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host flowspec1001.eqiad.wmnet [10:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flowspec1001.eqiad.wmnet [10:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:11] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:24:31] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:24:45] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1137.eqiad.wmnet with reason: Rebooting for T303174 [10:24:47] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1137.eqiad.wmnet with reason: Rebooting for T303174 [10:24:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:41] (03CR) 10Jbond: [C: 03+2] "thanks for the patch will merge" [puppet] - 10https://gerrit.wikimedia.org/r/778498 (owner: 10Zabe) [10:28:03] !log kormat@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P26474 and previous config saved to /var/cache/conftool/dbconfig/20220425-102803-kormat.json [10:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:47] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:38:10] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host testvm2004.codfw.wmnet [10:38:12] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:38:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:07] !log kormat@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P26475 and previous config saved to /var/cache/conftool/dbconfig/20220425-104307-kormat.json [10:43:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:03] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:45:31] !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [10:45:33] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T306560)', diff saved to https://phabricator.wikimedia.org/P26476 and previous config saved to /var/cache/conftool/dbconfig/20220425-104614-ladsgroup.json [10:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:19] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [10:50:55] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:54:09] !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [10:54:10] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host testvm2004.codfw.wmnet [10:54:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:32] (03CR) 10David Caro: [C: 03+2] hieradata: enable tls on eqiad1 rabbitmq too [puppet] - 10https://gerrit.wikimedia.org/r/785809 (https://phabricator.wikimedia.org/T297268) (owner: 10Majavah) [10:58:11] !log kormat@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P26477 and previous config saved to /var/cache/conftool/dbconfig/20220425-105811-kormat.json [10:58:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:01] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-master1001.eqiad.wmnet [11:01:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P26478 and previous config saved to /var/cache/conftool/dbconfig/20220425-110119-ladsgroup.json [11:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:39] 10SRE-tools, 10Cloud-Services, 10Infrastructure-Foundations, 10User-jbond: Update offboard-user script to use Keystone API - https://phabricator.wikimedia.org/T306788 (10jbond) p:05Triage→03Medium [11:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:02:16] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10User-jbond: Update offboard-user script to use Keystone API - https://phabricator.wikimedia.org/T306788 (10Majavah) [11:02:56] (03CR) 10Jbond: "thanks, q inline" [puppet] - 10https://gerrit.wikimedia.org/r/783190 (owner: 10Majavah) [11:07:48] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for ulogd2 [puppet] - 10https://gerrit.wikimedia.org/r/775282 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:11:40] (03CR) 10Muehlenhoff: [C: 03+2] Add mdadm processes to filter list for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/776929 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:11:41] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host an-master1001.eqiad.wmnet [11:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:15] !log kormat@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P26479 and previous config saved to /var/cache/conftool/dbconfig/20220425-111315-kormat.json [11:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:03] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:16:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P26480 and previous config saved to /var/cache/conftool/dbconfig/20220425-111625-ladsgroup.json [11:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:51] (03CR) 10Majavah: openldap: ldap is no longer authoritative for keystone projects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/783190 (owner: 10Majavah) [11:18:58] (KubernetesRsyslogDown) resolved: (2) rsyslog on kubestagemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:20:18] !log failover Ganeti master in codfw-test to ganeti-test2003 T306499 [11:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:22] T306499: Upgrade ganeti-test to Bullseye - https://phabricator.wikimedia.org/T306499 [11:20:57] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:21:21] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10tox-wikimedia, and 2 others: Introduce Python code formatters usage - https://phabricator.wikimedia.org/T211750 (10jbond) >>! In T211750#7853874, @Volans wrote: >>>! In T211750#7853334, @jhathaway wrote: >> Our we ready to consider running black on our pu... [11:24:08] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2001.codfw.wmnet [11:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:41] PROBLEM - ganeti-wconfd running on ganeti-test2002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [11:29:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2001.codfw.wmnet [11:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:03] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:31:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T306560)', diff saved to https://phabricator.wikimedia.org/P26481 and previous config saved to /var/cache/conftool/dbconfig/20220425-113130-ladsgroup.json [11:31:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [11:31:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [11:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:37] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [11:31:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T306560)', diff saved to https://phabricator.wikimedia.org/P26482 and previous config saved to /var/cache/conftool/dbconfig/20220425-113138-ladsgroup.json [11:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:53] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10tox-wikimedia, and 2 others: Introduce Python code formatters usage - https://phabricator.wikimedia.org/T211750 (10MoritzMuehlenhoff) >>! In T211750#7876514, @jbond wrote: > We already have [[ https://github.com/wikimedia/puppet/blob/production/rake_modul... [11:36:57] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:39:38] (03CR) 10Jbond: admin: Add placeholder to reserve uid and gid 914 for minio-user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/784633 (https://phabricator.wikimedia.org/T305446) (owner: 10Jcrespo) [11:40:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T306560)', diff saved to https://phabricator.wikimedia.org/P26483 and previous config saved to /var/cache/conftool/dbconfig/20220425-114003-ladsgroup.json [11:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:08] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [11:44:38] 10SRE-OnFire (FY2021/2022-Q4), 10observability, 10SRE Observability (FY2021/2022-Q4): Make 'status page' dashboard the default dashboard in Grafana - https://phabricator.wikimedia.org/T305954 (10lmata) [11:55:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P26484 and previous config saved to /var/cache/conftool/dbconfig/20220425-115508-ladsgroup.json [11:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:58] (03CR) 10Jbond: [C: 03+1] "lgtm" [homer/public] - 10https://gerrit.wikimedia.org/r/785274 (https://phabricator.wikimedia.org/T277438) (owner: 10Ayounsi) [11:57:18] (03PS1) 10Majavah: openstack::cinder: use TLS on rabbitmq connections [puppet] - 10https://gerrit.wikimedia.org/r/785840 (https://phabricator.wikimedia.org/T297268) [11:58:17] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-conf1001.eqiad.wmnet [11:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:18] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34953/console" [puppet] - 10https://gerrit.wikimedia.org/r/785840 (https://phabricator.wikimedia.org/T297268) (owner: 10Majavah) [12:02:51] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-conf1001.eqiad.wmnet [12:02:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:12] 10SRE-tools, 10Infrastructure-Foundations: Diffscan: host off-infra - https://phabricator.wikimedia.org/T265595 (10lmata) wikitech-static may be moving off rackspace: T304688 [12:06:57] RECOVERY - SSH on wtp1035.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:08:52] 10SRE, 10SRE-OnFire, 10Observability-Alerting, 10observability, 10I18n: Internationalization (i18n) & localization (l10n) of www.wikimediastatus.net - https://phabricator.wikimedia.org/T305896 (10lmata) [12:10:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P26485 and previous config saved to /var/cache/conftool/dbconfig/20220425-121013-ladsgroup.json [12:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:55] 10SRE-swift-storage: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621 (10Sascha) Will Toolforge and Cloud VPS jobs be able to read and write into their own custom buckets? (That would be super helpful.) [12:24:05] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:25:05] (03CR) 10Jbond: openldap: ldap is no longer authoritative for keystone projects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/783190 (owner: 10Majavah) [12:25:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T306560)', diff saved to https://phabricator.wikimedia.org/P26487 and previous config saved to /var/cache/conftool/dbconfig/20220425-122518-ladsgroup.json [12:25:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [12:25:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [12:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:25:24] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [12:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T306560)', diff saved to https://phabricator.wikimedia.org/P26488 and previous config saved to /var/cache/conftool/dbconfig/20220425-122531-ladsgroup.json [12:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:01] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10tox-wikimedia, and 2 others: Introduce Python code formatters usage - https://phabricator.wikimedia.org/T211750 (10jbond) >>! In T211750#7876563, @MoritzMuehlenhoff wrote: >>>! In T211750#7876514, @jbond wrote: >> We already have [[ https://github.com/wik... [12:28:29] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10lmata) thank you all for your feedback and input! [12:28:56] (03PS3) 10Majavah: openldap: ldap is no longer authoritative for keystone projects [puppet] - 10https://gerrit.wikimedia.org/r/783190 [12:28:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T306560)', diff saved to https://phabricator.wikimedia.org/P26489 and previous config saved to /var/cache/conftool/dbconfig/20220425-122856-ladsgroup.json [12:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:19] (03CR) 10Majavah: openldap: ldap is no longer authoritative for keystone projects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/783190 (owner: 10Majavah) [12:29:35] (03CR) 10jerkins-bot: [V: 04-1] openldap: ldap is no longer authoritative for keystone projects [puppet] - 10https://gerrit.wikimedia.org/r/783190 (owner: 10Majavah) [12:30:09] (03PS4) 10Majavah: openldap: ldap is no longer authoritative for keystone projects [puppet] - 10https://gerrit.wikimedia.org/r/783190 [12:33:58] (03CR) 10Jbond: [C: 03+2] "thanks will merge" [puppet] - 10https://gerrit.wikimedia.org/r/783190 (owner: 10Majavah) [12:42:27] (03PS1) 10Jbond: P:systemd::timesyncd: move nrpe script to usr/local [puppet] - 10https://gerrit.wikimedia.org/r/785843 [12:43:37] (03CR) 10Jbond: standard::ntp: move standard ntp to its own profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730852 (owner: 10Jbond) [12:44:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P26490 and previous config saved to /var/cache/conftool/dbconfig/20220425-124401-ladsgroup.json [12:44:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:27] (03CR) 10jerkins-bot: [V: 04-1] P:systemd::timesyncd: move nrpe script to usr/local [puppet] - 10https://gerrit.wikimedia.org/r/785843 (owner: 10Jbond) [12:49:11] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:51:52] (03PS2) 10Jbond: P:systemd::timesyncd: move nrpe script to usr/local [puppet] - 10https://gerrit.wikimedia.org/r/785843 [12:52:36] (03CR) 10jerkins-bot: [V: 04-1] P:systemd::timesyncd: move nrpe script to usr/local [puppet] - 10https://gerrit.wikimedia.org/r/785843 (owner: 10Jbond) [12:56:21] (03CR) 10Krinkle: [C: 03+2] Stop writing to $wmfSwiftConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/781058 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [12:57:19] (03CR) 10Krinkle: [C: 03+2] "Applied on the server. Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/781058 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [12:58:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host irc2001.wikimedia.org [12:58:10] !log krinkle@deploy1002 Synchronized private/PrivateSettings.php: If4d7ea24b4db59 (duration: 00m 59s) [12:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:16] (03Merged) 10jenkins-bot: Stop writing to $wmfSwiftConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/781058 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [12:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:23] (03PS3) 10Jbond: P:systemd::timesyncd: move nrpe script to usr/local [puppet] - 10https://gerrit.wikimedia.org/r/785843 [12:58:40] (03CR) 10Krinkle: Improve support for realms other than production and labs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/781060 (owner: 10Ahmon Dancy) [12:59:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P26491 and previous config saved to /var/cache/conftool/dbconfig/20220425-125906-ladsgroup.json [12:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:04] RoanKattouw, Lucas_WMDE, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220425T1300). [13:00:04] tgr: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2001.wikimedia.org [13:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:14] I can deploy today! [13:00:24] tgr: hello! [13:00:29] o/ [13:00:56] the patch is only testable on beta, so no mwdebug needed [13:01:08] okay [13:01:14] (03PS2) 10Urbanecm: GrowthExperiments: Do not use 'facebook' in campaign pattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785245 (https://phabricator.wikimedia.org/T303785) (owner: 10Gergő Tisza) [13:01:16] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: Do not use 'facebook' in campaign pattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785245 (https://phabricator.wikimedia.org/T303785) (owner: 10Gergő Tisza) [13:01:21] I'll just sync it in that case [13:02:02] (03CR) 10Urbanecm: [C: 03+2] GlobalUserSelectQueryBuilder: Do not fatal when no users are returned [extensions/CentralAuth] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/785207 (https://phabricator.wikimedia.org/T306535) (owner: 10Urbanecm) [13:02:14] (03Merged) 10jenkins-bot: GrowthExperiments: Do not use 'facebook' in campaign pattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785245 (https://phabricator.wikimedia.org/T303785) (owner: 10Gergő Tisza) [13:02:21] I'll also sync my backport to unbreak mentor dashboard [13:02:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:02:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:26] Sorry if it bothered you, but what should I do to ask for merging the patchset in mediawiki/core ? [13:04:36] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 0338c9bb9ff6ad388270d1e2df5013d8a49d1210: GrowthExperiments: Do not use facebook in campaign pattern (T303785) (duration: 00m 51s) [13:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:44] Or where should I ask for help? [13:05:13] tgr: synced. anything else? [13:05:15] (03Merged) 10jenkins-bot: GlobalUserSelectQueryBuilder: Do not fatal when no users are returned [extensions/CentralAuth] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/785207 (https://phabricator.wikimedia.org/T306535) (owner: 10Urbanecm) [13:05:24] urbanecm: that was all, thanks! [13:05:30] no problem [13:05:37] Winston_Sung[m]: are you looking for code review? [13:05:39] T303785: Account creation: social media landing pages - https://phabricator.wikimedia.org/T303785 [13:05:50] Yes. [13:06:01] (03CR) 10Jbond: [C: 03+2] P:systemd::timesyncd: move nrpe script to usr/local [puppet] - 10https://gerrit.wikimedia.org/r/785843 (owner: 10Jbond) [13:06:08] hello Winston_Sung[m], sorry to hear you experience issues with finding a reviewer. https://www.mediawiki.org/wiki/Gerrit/Code_review/Getting_reviews has a lot of useful tips that apply in mediawiki/core [13:06:47] #wikimedia-dev is probably the best channel for that. [13:07:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:07:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:33] backport works. syncing. [13:12:06] !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.8/extensions/CentralAuth/includes/User/GlobalUserSelectQueryBuilder.php: c4c4c3219ad69705c83caa50754d95285c96f352: GlobalUserSelectQueryBuilder: Do not fatal when no users are returned (T306535) (duration: 00m 54s) [13:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:12] * urbanecm done [13:12:23] !log UTC afternoon B&C window done [13:12:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:12:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:12:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:12:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:13:11] T306535: Mentor dashboard stopped updating - https://phabricator.wikimedia.org/T306535 [13:13:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:58] (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:wikidough: add a check to ensure service has been restarted [puppet] - 10https://gerrit.wikimedia.org/r/784697 (owner: 10Ssingh) [13:14:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T306560)', diff saved to https://phabricator.wikimedia.org/P26492 and previous config saved to /var/cache/conftool/dbconfig/20220425-131411-ladsgroup.json [13:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:11] (03PS2) 10Herron: sre.kafka.reboot-workers: remove systemctl stop calls [cookbooks] - 10https://gerrit.wikimedia.org/r/778517 (https://phabricator.wikimedia.org/T305652) [13:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:40] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [13:17:21] PROBLEM - puppet last run on ml-staging-ctrl2002 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:18:20] (03PS1) 10Jbond: P:sre: handlke missing manager [puppet] - 10https://gerrit.wikimedia.org/r/785851 [13:20:21] "hello Winston_Sung, sorry to..." <- Thanks. [13:21:18] (03PS3) 10Krinkle: Improve support for realms other than production and labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/781060 (owner: 10Ahmon Dancy) [13:21:30] (03CR) 10Krinkle: [C: 03+1] Improve support for realms other than production and labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/781060 (owner: 10Ahmon Dancy) [13:21:59] (03CR) 10jerkins-bot: [V: 04-1] Improve support for realms other than production and labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/781060 (owner: 10Ahmon Dancy) [13:23:34] (03CR) 10Herron: [C: 03+2] "Thanks all!" [cookbooks] - 10https://gerrit.wikimedia.org/r/778517 (https://phabricator.wikimedia.org/T305652) (owner: 10Herron) [13:24:48] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-conf1002.eqiad.wmnet [13:24:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:44] (03CR) 10Jbond: [C: 03+2] P:sre: handlke missing manager [puppet] - 10https://gerrit.wikimedia.org/r/785851 (owner: 10Jbond) [13:26:29] (03PS4) 10Jbond: P:sre::check_user: add support for namely API [puppet] - 10https://gerrit.wikimedia.org/r/761029 (https://phabricator.wikimedia.org/T255750) [13:27:52] (03Merged) 10jenkins-bot: sre.kafka.reboot-workers: remove systemctl stop calls [cookbooks] - 10https://gerrit.wikimedia.org/r/778517 (https://phabricator.wikimedia.org/T305652) (owner: 10Herron) [13:28:39] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785852 (https://phabricator.wikimedia.org/T304813) (owner: 10Awight) [13:28:49] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-conf1002.eqiad.wmnet [13:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:14] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-cluster [13:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:46] !log maintenance (rolling reboot) on api_appserver in codfw (cookbook sre.hosts.reboot-cluster -D codfw -c api_appserver --percentage 5 --grace_sleep 60) [13:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:09] (03PS4) 10Krinkle: Improve support for realms other than production and labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/781060 (owner: 10Ahmon Dancy) [13:35:18] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-conf1003.eqiad.wmnet [13:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:27] (03CR) 10Krinkle: [C: 03+1] "Now with passing tests :) Sorry." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/781060 (owner: 10Ahmon Dancy) [13:35:42] (03CR) 10Herron: [C: 04-2] "These points all make sense. I'll abandon this for now as it seems unnecessary. We can always revisit in the future if needed" [puppet] - 10https://gerrit.wikimedia.org/r/779086 (https://phabricator.wikimedia.org/T305652) (owner: 10Herron) [13:36:02] (03Abandoned) 10Herron: kafka-mirror: startup after kafka.service, shutdown before kafka.service [puppet] - 10https://gerrit.wikimedia.org/r/779086 (https://phabricator.wikimedia.org/T305652) (owner: 10Herron) [13:36:25] (03CR) 10jerkins-bot: [V: 04-1] Improve support for realms other than production and labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/781060 (owner: 10Ahmon Dancy) [13:39:55] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-conf1003.eqiad.wmnet [13:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:04] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs1010.eqiad.wmnet [13:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:46] (03PS5) 10Krinkle: Improve support for realms other than production and labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/781060 (owner: 10Ahmon Dancy) [13:46:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:51:10] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aqs1010.eqiad.wmnet [13:51:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:11] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs1011.eqiad.wmnet [14:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:26] 10SRE, 10Data-Engineering, 10Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227 (10BTullis) cc @Milimetric and @Ottomata who probably know the most about the current behaviour regarding `wprov` and mobile vs desktop view recording. [14:10:08] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host aqs1011.eqiad.wmnet [14:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:10] (03CR) 10Krinkle: [C: 03+1] Improve support for realms other than production and labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/781060 (owner: 10Ahmon Dancy) [14:13:57] !log mw2253: remove puppet lock of stuck puppet run due to reboot, run-puppet-agent [14:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:06] (03CR) 10Dzahn: "Let me take this and merge it later today." [puppet] - 10https://gerrit.wikimedia.org/r/785147 (https://phabricator.wikimedia.org/T290192) (owner: 10Jelto) [14:20:22] (03CR) 10Jelto: site: use appserver in codfw C3, cleanup duplicate insetup definition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/785147 (https://phabricator.wikimedia.org/T290192) (owner: 10Jelto) [14:21:39] !log herron@cumin1001 START - Cookbook sre.kafka.reboot-workers for Kafka logging-codfw cluster: Reboot kafka nodes [14:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:56] (03CR) 10JMeybohm: [C: 03+1] "PCC is a bit weird as is shows some k8s nodes and masters with changes although they show as "no change" in details." [puppet] - 10https://gerrit.wikimedia.org/r/785226 (owner: 10Dzahn) [14:26:37] (03CR) 10WMDE-Fisch: [C: 03+1] [beta] Stash all logs for the Kartographer extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785852 (https://phabricator.wikimedia.org/T304813) (owner: 10Awight) [14:30:42] (03CR) 10JMeybohm: [C: 04-1] add a namespace for new service image-suggestions (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/775964 (https://phabricator.wikimedia.org/T304891) (owner: 10Dzahn) [14:31:06] (03CR) 10JMeybohm: [C: 04-1] add a namespace for new service image-suggestions (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/775964 (https://phabricator.wikimedia.org/T304891) (owner: 10Dzahn) [14:36:01] (03PS8) 10Krinkle: List Kartographer static map exemptions and document+flip default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773883 (https://phabricator.wikimedia.org/T291736) [14:36:04] (03CR) 10Krinkle: [C: 03+2] List Kartographer static map exemptions and document+flip default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773883 (https://phabricator.wikimedia.org/T291736) (owner: 10Krinkle) [14:37:09] (03Merged) 10jenkins-bot: List Kartographer static map exemptions and document+flip default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773883 (https://phabricator.wikimedia.org/T291736) (owner: 10Krinkle) [14:37:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:43:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:44:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:44:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:44:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:02] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:46] (03CR) 10Jcrespo: admin: Add placeholder to reserve uid and gid 914 for minio-user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/784633 (https://phabricator.wikimedia.org/T305446) (owner: 10Jcrespo) [14:45:51] (03PS1) 10Ottomata: apt distributions - Add thirdparty/confluent componenet to bullsye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/785866 (https://phabricator.wikimedia.org/T304433) [14:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:18] (03CR) 10Ottomata: [C: 03+2] apt distributions - Add thirdparty/confluent componenet to bullsye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/785866 (https://phabricator.wikimedia.org/T304433) (owner: 10Ottomata) [14:49:30] PROBLEM - Host mw2286 is DOWN: PING CRITICAL - Packet loss = 100% [14:50:21] ^ thats me with the sre.hosts.reboot-cluster cookbook. Host is not coming up properly [14:51:46] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:52:32] !log krinkle@deploy1002 Synchronized wmf-config/InitialiseSettings.php: I22240af06d (duration: 01m 42s) [14:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:03] jelto: while the cookbook is trying to connect to the host you can connect to the mgmt in parallel at root@mw2286.mgmt.codfw.wmnet and then powercycle it via "racadm server action powercycle" [14:54:42] happens sometimes when the regular IPMI command sent by the cookbook doesn't trigger the reboot reliably [14:56:51] (03CR) 10Muehlenhoff: docker: ensure apparmor package is installed if on bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/785226 (owner: 10Dzahn) [14:59:16] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:59:47] ^ checking [15:01:30] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:01:42] (03PS1) 10Jbond: P:mail: also exclude posfix aliases from vtr router [puppet] - 10https://gerrit.wikimedia.org/r/785870 [15:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:03:42] (03PS2) 10Jbond: P:mail: also exclude posfix aliases from vtr router [puppet] - 10https://gerrit.wikimedia.org/r/785870 [15:04:37] (03PS3) 10Jbond: P:mail: also exclude posfix aliases from vtr router [puppet] - 10https://gerrit.wikimedia.org/r/785870 [15:06:28] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] [beta] Stash all logs for the Kartographer extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785852 (https://phabricator.wikimedia.org/T304813) (owner: 10Awight) [15:06:40] (03CR) 10jerkins-bot: [V: 04-1] P:mail: also exclude posfix aliases from vtr router [puppet] - 10https://gerrit.wikimedia.org/r/785870 (owner: 10Jbond) [15:09:51] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "test sync - jbond@cumin1001" [15:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:08] PROBLEM - SSH on wtp1035.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:10:41] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "test sync - jbond@cumin1001" [15:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:04] (03PS2) 10Sergio Gimeno: Newcomer tasks: deploy AND topic selection to pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780874 (https://phabricator.wikimedia.org/T305399) [15:11:50] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host pki2001.codfw.wmnet [15:11:51] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host pki2001.codfw.wmnet [15:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:27] !log jelto@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=1) [15:12:43] (03PS1) 10Jbond: pki: update primary server [dns] - 10https://gerrit.wikimedia.org/r/785873 [15:12:55] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host pki2001.codfw.wmnet [15:12:57] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host pki2001.codfw.wmnet [15:13:14] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host pki2001.codfw.wmnet [15:13:14] (03CR) 10jerkins-bot: [V: 04-1] pki: update primary server [dns] - 10https://gerrit.wikimedia.org/r/785873 (owner: 10Jbond) [15:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:37] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host pki-root1001.eqiad.wmnet [15:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:34] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pki2001.codfw.wmnet [15:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:09] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pki-root1001.eqiad.wmnet [15:20:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:06] !log jbond@cumin1001 START - Cookbook sre.dns.netbox [15:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:16] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:30] (03CR) 10Jbond: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/785873 (owner: 10Jbond) [15:24:54] !log jbond@cumin1001 START - Cookbook sre.dns.netbox [15:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:17] (03CR) 10Ahmon Dancy: [C: 03+1] "Thanks Timo!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/781060 (owner: 10Ahmon Dancy) [15:30:04] jan_drewniak: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220425T1530). [15:30:44] 10SRE-tools, 10DNS, 10Infrastructure-Foundations, 10netbox: sre.dns.netbox cookbook dosn't support period terminated domains - https://phabricator.wikimedia.org/T306809 (10jbond) p:05Triage→03Medium [15:30:52] (03CR) 10Jbond: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/785873 (owner: 10Jbond) [15:32:56] !log herron@cumin1001 END (PASS) - Cookbook sre.kafka.reboot-workers (exit_code=0) for Kafka logging-codfw cluster: Reboot kafka nodes [15:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:13] !log jbond@cumin1001 START - Cookbook sre.dns.netbox [15:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:08] !log herron@cumin1001 START - Cookbook sre.kafka.reboot-workers for Kafka logging-eqiad cluster: Reboot kafka nodes [15:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:16] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:55] (03CR) 10Jbond: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/785873 (owner: 10Jbond) [15:44:09] (03CR) 10Jbond: [C: 03+2] pki: update primary server [dns] - 10https://gerrit.wikimedia.org/r/785873 (owner: 10Jbond) [15:46:05] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache pki.discovery.wmnet on all recursors [15:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:08] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) pki.discovery.wmnet on all recursors [15:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:30] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host pki1001.eqiad.wmnet [15:47:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:43] (03PS1) 10Jbond: Revert "pki: update primary server" [dns] - 10https://gerrit.wikimedia.org/r/785364 [15:54:02] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pki1001.eqiad.wmnet [15:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:20] (03CR) 10Jbond: [C: 03+2] Revert "pki: update primary server" [dns] - 10https://gerrit.wikimedia.org/r/785364 (owner: 10Jbond) [15:56:24] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache pki.discovery.wmnet on all recursors [15:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:27] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) pki.discovery.wmnet on all recursors [15:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:29] (03CR) 10Jbond: "no probs 😊" [puppet] - 10https://gerrit.wikimedia.org/r/784633 (https://phabricator.wikimedia.org/T305446) (owner: 10Jcrespo) [16:00:45] (03PS4) 10Jbond: P:mail: also exclude posfix aliases from vtr router [puppet] - 10https://gerrit.wikimedia.org/r/785870 [16:08:59] (03CR) 10JHathaway: [C: 03+1] "minor comments, but looks good, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/785870 (owner: 10Jbond) [16:11:16] RECOVERY - SSH on wtp1035.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:16:58] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:28:45] (03PS3) 10Sergio Gimeno: Newcomer tasks: deploy AND topic selection to pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780874 (https://phabricator.wikimedia.org/T305399) [16:37:34] (03PS3) 10Andrew Bogott: ceph: allow ceph user to access smart utils [puppet] - 10https://gerrit.wikimedia.org/r/785390 [16:42:06] (03PS1) 10Andrew Bogott: Prepare cloudweb2001-dev for decom [puppet] - 10https://gerrit.wikimedia.org/r/785886 (https://phabricator.wikimedia.org/T304881) [16:44:20] (03PS1) 10Andrew Bogott: Prepare cloudcephmon200[2-3]-dev for decom [puppet] - 10https://gerrit.wikimedia.org/r/785887 (https://phabricator.wikimedia.org/T304881) [16:46:31] (03CR) 10Herron: [C: 03+1] prometheus: remove per-exporter up checks [puppet] - 10https://gerrit.wikimedia.org/r/784636 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [16:46:54] (03Abandoned) 10Andrew Bogott: ceph::mon: actually install mon and mgr packages on mon nodes [puppet] - 10https://gerrit.wikimedia.org/r/784767 (owner: 10Andrew Bogott) [16:47:02] (03CR) 10Herron: [C: 03+1] thanos: aggregate exporter 'up' metrics [puppet] - 10https://gerrit.wikimedia.org/r/784635 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [16:47:03] 10SRE-OnFire (FY2021/2022-Q3): incidents occurring in Q3 have been scored with the scorecard - https://phabricator.wikimedia.org/T299977 (10lmata) a:03jcrespo [16:47:07] 10SRE-OnFire (FY2021/2022-Q4): incidents occurring during Q4 have been scored with the scorecard - https://phabricator.wikimedia.org/T306511 (10lmata) a:03jcrespo [16:47:12] (03CR) 10Andrew Bogott: [C: 03+2] Prepare cloudweb2001-dev for decom [puppet] - 10https://gerrit.wikimedia.org/r/785886 (https://phabricator.wikimedia.org/T304881) (owner: 10Andrew Bogott) [16:47:27] 10SRE-OnFire (FY2021/2022-Q2): incidents occurring during Q2 have been scored with the scorecard - https://phabricator.wikimedia.org/T292254 (10lmata) a:03jcrespo [16:47:55] !log herron@cumin1001 END (PASS) - Cookbook sre.kafka.reboot-workers (exit_code=0) for Kafka logging-eqiad cluster: Reboot kafka nodes [16:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:11] (03CR) 10Herron: [C: 03+1] prometheus: migrate prometheus_directorysize cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/782359 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [16:49:23] (03CR) 10David Caro: [C: 03+1] ceph: allow ceph user to access smart utils [puppet] - 10https://gerrit.wikimedia.org/r/785390 (owner: 10Andrew Bogott) [16:49:40] (03CR) 10Herron: [C: 03+1] logstash: populate target index format and add pipeline diagnostics [puppet] - 10https://gerrit.wikimedia.org/r/775375 (https://phabricator.wikimedia.org/T305090) (owner: 10Cwhite) [16:50:17] (03CR) 10Herron: [C: 03+1] logstash: rewrite ecs settings [puppet] - 10https://gerrit.wikimedia.org/r/777887 (https://phabricator.wikimedia.org/T305013) (owner: 10Cwhite) [16:50:39] (03CR) 10Herron: [C: 03+1] logstash: transform rotation frequency values to datestamp format [puppet] - 10https://gerrit.wikimedia.org/r/777882 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [16:51:02] (03CR) 10Herron: [C: 03+1] logstash: set partition on legacy indexes [puppet] - 10https://gerrit.wikimedia.org/r/777880 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [16:51:26] (03PS4) 10Andrew Bogott: ceph: allow ceph user to access smart utils on mon nodes [puppet] - 10https://gerrit.wikimedia.org/r/785390 [16:51:48] (03CR) 10Andrew Bogott: [C: 03+2] Prepare cloudcephmon200[2-3]-dev for decom [puppet] - 10https://gerrit.wikimedia.org/r/785887 (https://phabricator.wikimedia.org/T304881) (owner: 10Andrew Bogott) [16:52:12] (03PS2) 10Andrew Bogott: Prepare cloudcephmon200[2-3]-dev for decom [puppet] - 10https://gerrit.wikimedia.org/r/785887 (https://phabricator.wikimedia.org/T304881) [16:53:40] (03CR) 10Andrew Bogott: [C: 03+2] ceph: allow ceph user to access smart utils on mon nodes [puppet] - 10https://gerrit.wikimedia.org/r/785390 (owner: 10Andrew Bogott) [16:55:40] (03CR) 10BryanDavis: [C: 03+1] labtestwiki: update labtest ldap server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785399 (https://phabricator.wikimedia.org/T304881) (owner: 10Andrew Bogott) [16:56:28] 10SRE, 10SRE Observability: sre.kafka.reboot-workers fails on logging cluster with failed to execute command 'systemctl stop kafka-mirror': - https://phabricator.wikimedia.org/T305652 (10herron) 05Open→03Resolved a:03herron A round of kafka-logging rolling reboots was completed today using sre.kafka.rebo... [16:57:33] (03PS1) 10Reedy: wikitech.php: Minor cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785889 [17:00:02] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:05] ryankemper: #bothumor I � Unicode. All rise for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220425T1700). [17:04:46] !log aokoth@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM otrs1001.eqiad.wmnet [17:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:52] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:22] (03PS1) 10Andrew Bogott: Remove references to cloudcephosd200[2,3]-dev [puppet] - 10https://gerrit.wikimedia.org/r/785893 (https://phabricator.wikimedia.org/T304881) [17:07:57] (03CR) 10jerkins-bot: [V: 04-1] Remove references to cloudcephosd200[2,3]-dev [puppet] - 10https://gerrit.wikimedia.org/r/785893 (https://phabricator.wikimedia.org/T304881) (owner: 10Andrew Bogott) [17:09:21] (03PS2) 10Andrew Bogott: Remove references to cloudcephosd200[2,3]-dev [puppet] - 10https://gerrit.wikimedia.org/r/785893 (https://phabricator.wikimedia.org/T304881) [17:11:20] (03CR) 10jerkins-bot: [V: 04-1] Remove references to cloudcephosd200[2,3]-dev [puppet] - 10https://gerrit.wikimedia.org/r/785893 (https://phabricator.wikimedia.org/T304881) (owner: 10Andrew Bogott) [17:12:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [17:12:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [17:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T306560)', diff saved to https://phabricator.wikimedia.org/P26493 and previous config saved to /var/cache/conftool/dbconfig/20220425-171223-ladsgroup.json [17:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:38] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [17:14:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T306560)', diff saved to https://phabricator.wikimedia.org/P26494 and previous config saved to /var/cache/conftool/dbconfig/20220425-171441-ladsgroup.json [17:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:39] (03PS3) 10Andrew Bogott: Remove references to cloudcephosd200[2,3]-dev [puppet] - 10https://gerrit.wikimedia.org/r/785893 (https://phabricator.wikimedia.org/T304881) [17:17:25] !log aokoth@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM otrs1001.eqiad.wmnet [17:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:38] PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:17:54] (03CR) 10Andrew Bogott: [C: 03+2] Remove references to cloudcephosd200[2,3]-dev [puppet] - 10https://gerrit.wikimedia.org/r/785893 (https://phabricator.wikimedia.org/T304881) (owner: 10Andrew Bogott) [17:18:23] (03PS1) 10Herron: sre.elasticsearch add logging clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/785895 (https://phabricator.wikimedia.org/T255864) [17:21:23] (03PS1) 10Andrew Bogott: Remove more references to cloudcephosd200[2,3]-dev [puppet] - 10https://gerrit.wikimedia.org/r/785897 (https://phabricator.wikimedia.org/T304881) [17:22:07] (03CR) 10jerkins-bot: [V: 04-1] sre.elasticsearch add logging clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/785895 (https://phabricator.wikimedia.org/T255864) (owner: 10Herron) [17:23:58] (03PS2) 10Herron: sre.elasticsearch add logging clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/785895 (https://phabricator.wikimedia.org/T255864) [17:24:08] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudservices[2002-2003]-dev.wikimedia.org [17:24:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:29] (03CR) 10Andrew Bogott: [C: 03+2] Remove more references to cloudcephosd200[2,3]-dev [puppet] - 10https://gerrit.wikimedia.org/r/785897 (https://phabricator.wikimedia.org/T304881) (owner: 10Andrew Bogott) [17:28:10] (03CR) 10Herron: [C: 03+2] sre.elasticsearch add logging clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/785895 (https://phabricator.wikimedia.org/T255864) (owner: 10Herron) [17:29:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P26495 and previous config saved to /var/cache/conftool/dbconfig/20220425-172946-ladsgroup.json [17:29:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:26] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [17:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:52] (03CR) 10Ladsgroup: [C: 03+1] Remove meaningless restriction level "none" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776200 (owner: 10Thiemo Kreuz (WMDE)) [17:33:51] (03PS1) 10Andrew Bogott: Remove puppet for cloudservices200[2,3]-dev [puppet] - 10https://gerrit.wikimedia.org/r/785900 (https://phabricator.wikimedia.org/T306669) [17:34:28] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:34:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:59] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:36:36] (03Merged) 10jenkins-bot: sre.elasticsearch add logging clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/785895 (https://phabricator.wikimedia.org/T255864) (owner: 10Herron) [17:36:39] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:36:50] (03PS1) 10Ssingh: P:wikidough: update notes_url to point to new monitoring page [puppet] - 10https://gerrit.wikimedia.org/r/785901 [17:37:48] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34955/console" [puppet] - 10https://gerrit.wikimedia.org/r/785901 (owner: 10Ssingh) [17:39:03] (03CR) 10Andrew Bogott: [C: 03+2] Remove puppet for cloudservices200[2,3]-dev [puppet] - 10https://gerrit.wikimedia.org/r/785900 (https://phabricator.wikimedia.org/T306669) (owner: 10Andrew Bogott) [17:39:33] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudservices[2002-2003]-dev.wikimedia.org [17:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:42] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): Decom cloudservices200[2,3]-dev.wikimedia.org - https://phabricator.wikimedia.org/T306669 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: `cloudservices[2002-2003]-dev.wikimed... [17:40:45] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 234, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:41:01] (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:wikidough: update notes_url to point to new monitoring page [puppet] - 10https://gerrit.wikimedia.org/r/785901 (owner: 10Ssingh) [17:41:34] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): Decom cloudservices200[2,3]-dev.wikimedia.org - https://phabricator.wikimedia.org/T306669 (10Andrew) a:05Andrew→03Papaul I've run sre.hosts.decommission so this should be ready for deracking &c. [17:44:23] (03PS1) 10Ottomata: profile::docker::engine - Allow not setting version [puppet] - 10https://gerrit.wikimedia.org/r/785904 (https://phabricator.wikimedia.org/T304433) [17:44:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P26496 and previous config saved to /var/cache/conftool/dbconfig/20220425-174451-ladsgroup.json [17:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:56] 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 5 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10aaron) >>! In T212129#7865521, @Krinkle wrote: > Next: Decide on how and whether to fragment the data in mainstas... [17:46:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:46:36] (03CR) 10Ottomata: [C: 03+2] profile::docker::engine - Allow not setting version [puppet] - 10https://gerrit.wikimedia.org/r/785904 (https://phabricator.wikimedia.org/T304433) (owner: 10Ottomata) [17:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:55:07] (03PS1) 10Ottomata: profile::docker::engine - make $version Optional [puppet] - 10https://gerrit.wikimedia.org/r/785905 (https://phabricator.wikimedia.org/T304433) [17:55:23] 10SRE, 10Scap, 10Python3-Porting, 10Release-Engineering-Team (Priority Backlog 📥): git-fat needs to be ported to Python 3 - https://phabricator.wikimedia.org/T279509 (10dancy) @thcipriani Based on reading about git-lfs and git-fat (including outstanding issues on GitHub), I'm in favor of migrating to git-l... [17:58:03] (03CR) 10Ottomata: [C: 03+2] profile::docker::engine - make $version Optional [puppet] - 10https://gerrit.wikimedia.org/r/785905 (https://phabricator.wikimedia.org/T304433) (owner: 10Ottomata) [17:58:26] 10SRE: role_contacts (service owners) as a custom puppet fact - https://phabricator.wikimedia.org/T306830 (10Dzahn) [17:59:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T306560)', diff saved to https://phabricator.wikimedia.org/P26497 and previous config saved to /var/cache/conftool/dbconfig/20220425-175957-ladsgroup.json [18:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:03] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [18:02:04] (03PS1) 10Ottomata: docker init.pp - make $version Optional [puppet] - 10https://gerrit.wikimedia.org/r/785907 [18:03:01] 10SRE, 10Scap, 10Python3-Porting, 10Release-Engineering-Team (Priority Backlog 📥): git-fat needs to be ported to Python 3 - https://phabricator.wikimedia.org/T279509 (10Dzahn) T214229 - scap3 + git-fat results in git status with permissions errors T202100 - Intermittent git-fat failure during deploy T147... [18:03:58] (03CR) 10Ottomata: [C: 03+2] docker init.pp - make $version Optional [puppet] - 10https://gerrit.wikimedia.org/r/785907 (owner: 10Ottomata) [18:04:02] (03PS1) 10Bernard Wang: Enable TOC for all users opted into modern Vector outside of pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785908 [18:04:35] (03PS2) 10Bernard Wang: Enable TOC for all users opted into modern Vector outside of pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785908 (https://phabricator.wikimedia.org/T306608) [18:05:16] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:08:02] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:08:24] PROBLEM - Check systemd state on an-worker1078 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:09:30] PROBLEM - Hadoop NodeManager on an-worker1078 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:11:40] (03CR) 10Nray: [C: 03+1] Enable TOC for all users opted into modern Vector outside of pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785908 (https://phabricator.wikimedia.org/T306608) (owner: 10Bernard Wang) [18:13:20] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:14:37] (03PS1) 10Ottomata: docker engine - pick smart default for package name [puppet] - 10https://gerrit.wikimedia.org/r/785910 (https://phabricator.wikimedia.org/T304433) [18:19:04] (03CR) 10Ottomata: [C: 03+2] docker engine - pick smart default for package name [puppet] - 10https://gerrit.wikimedia.org/r/785910 (https://phabricator.wikimedia.org/T304433) (owner: 10Ottomata) [18:19:21] (03CR) 10MewOphaswongse: Newcomer tasks: deploy AND topic selection to pilot wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780874 (https://phabricator.wikimedia.org/T305399) (owner: 10Sergio Gimeno) [18:20:00] (03CR) 10Ottomata: "Just saw this. I have just merged some patches that makes default $version = undef" [puppet] - 10https://gerrit.wikimedia.org/r/670840 (owner: 10Muehlenhoff) [18:32:11] RECOVERY - Check systemd state on an-worker1078 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:37:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:38:09] (03CR) 10Dzahn: [C: 03+2] site: use appserver in codfw C3, cleanup duplicate insetup definition [puppet] - 10https://gerrit.wikimedia.org/r/785147 (https://phabricator.wikimedia.org/T290192) (owner: 10Jelto) [18:38:19] (03PS4) 10Dzahn: site: use appserver in codfw C3, cleanup duplicate insetup definition [puppet] - 10https://gerrit.wikimedia.org/r/785147 (https://phabricator.wikimedia.org/T290192) (owner: 10Jelto) [18:38:53] RECOVERY - Hadoop NodeManager on an-worker1078 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:39:15] 10SRE, 10Scap, 10Python3-Porting, 10Release-Engineering-Team (Priority Backlog 📥): git-fat needs to be ported to Python 3 - https://phabricator.wikimedia.org/T279509 (10thcipriani) >>! In T279509#7877992, @dancy wrote: > I can help on the scap side. I haven't touched archiva yet. There is support in scap... [18:52:25] jouncebot nowandnext [18:52:25] No deployments scheduled for the next 1 hour(s) and 7 minute(s) [18:52:25] In 1 hour(s) and 7 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220425T2000) [18:56:26] 10SRE, 10Scap, 10Python3-Porting, 10Release-Engineering-Team (Priority Backlog 📥): git-fat needs to be ported to Python 3 - https://phabricator.wikimedia.org/T279509 (10Ottomata) > deploy directly from Gerrit ...say more :) The jar binaries are built by maven-release-plugin in a jenkins job and then uplo... [18:58:40] (03PS1) 10Gergő Tisza: Add Link: Add 'excluded sections' task setting [extensions/GrowthExperiments] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/785926 (https://phabricator.wikimedia.org/T304150) [19:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:02:56] (03CR) 10Dzahn: "noop confirmed on mw1451,mw1452,mw1453..." [puppet] - 10https://gerrit.wikimedia.org/r/785147 (https://phabricator.wikimedia.org/T290192) (owner: 10Jelto) [19:09:09] !log install grafana-plugins 0.5 and restart grafana on grafana1002 T304583 [19:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:13] !log turning mw2412 through mw2419 into actual appservers - applying roles for the first time, will cause alerts probably [19:09:15] T304583: Add meta data from the run to the page drill down pages - https://phabricator.wikimedia.org/T304583 [19:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:30] (03CR) 10Cwhite: [C: 03+2] grafana: provision JSON datasource [puppet] - 10https://gerrit.wikimedia.org/r/774380 (https://phabricator.wikimedia.org/T304583) (owner: 10Phedenskog) [19:12:27] jouncebot nowandnext [19:12:27] No deployments scheduled for the next 0 hour(s) and 47 minute(s) [19:12:27] In 0 hour(s) and 47 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220425T2000) [19:13:01] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:13:40] (03CR) 10Ahmon Dancy: [C: 03+2] Improve support for realms other than production and labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/781060 (owner: 10Ahmon Dancy) [19:14:33] PROBLEM - SSH on wtp1035.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:19:29] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:21:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10RobH) [19:22:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10RobH) [19:22:30] (03PS6) 10Ahmon Dancy: Improve support for realms other than production and labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/781060 [19:23:04] (03CR) 10Ahmon Dancy: Improve support for realms other than production and labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/781060 (owner: 10Ahmon Dancy) [19:23:10] (03CR) 10Ahmon Dancy: [C: 03+2] Improve support for realms other than production and labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/781060 (owner: 10Ahmon Dancy) [19:23:51] (03Merged) 10jenkins-bot: Improve support for realms other than production and labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/781060 (owner: 10Ahmon Dancy) [19:26:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:26:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:26:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:56] !log dancy@deploy1002 Synchronized multiversion/MWConfigCacheGenerator.php: Config: [[gerrit:781060|Improve support for realms other than production and labs]] (duration: 01m 42s) [19:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:52] !log dancy@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:781060|Improve support for realms other than production and labs]] (duration: 01m 43s) [19:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:31] !log dancy@deploy1002 Started scap: Config: [[gerrit:781060|Improve support for realms other than production and labs]] [19:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:25] PROBLEM - Memcached on mw2412 is CRITICAL: connect to address 10.192.32.59 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [19:33:36] !log dancy@deploy1002 Started scap: Config: [[gerrit:781060|Improve support for realms other than production and labs]] [19:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:29] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 4:00:00 on mw2412.codfw.wmnet with reason: fresh role user [19:34:31] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mw2412.codfw.wmnet with reason: fresh role user [19:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:49] !log rebooting mw2412, mw2413 [19:43:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:15] RECOVERY - Memcached on mw2412 is OK: TCP OK - 0.033 second response time on 10.192.32.59 port 11210 https://wikitech.wikimedia.org/wiki/Memcached [19:46:30] !log dancy@deploy1002 Finished scap: Config: [[gerrit:781060|Improve support for realms other than production and labs]] (duration: 12m 54s) [19:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:53] 10SRE, 10Scap, 10Python3-Porting, 10Release-Engineering-Team (Priority Backlog 📥): git-fat needs to be ported to Python 3 - https://phabricator.wikimedia.org/T279509 (10thcipriani) >>! In T279509#7878185, @Ottomata wrote: >> deploy directly from Gerrit > > ...say more :) > > The jar binaries are built by... [19:52:24] PROBLEM - Host mw2418 is DOWN: PING CRITICAL - Packet loss = 100% [19:52:59] 10SRE, 10Scap, 10Python3-Porting, 10Release-Engineering-Team (Priority Backlog 📥): git-fat needs to be ported to Python 3 - https://phabricator.wikimedia.org/T279509 (10Ottomata) Can .jar .gitattributes be manged by git-lfs to download from Archiva API directly? E.g. this URL: http://archiva.wikimedia.org... [19:53:06] !log rebooting mw2414 through mw2419 [19:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:18] PROBLEM - Host mw2417 is DOWN: PING CRITICAL - Packet loss = 100% [19:53:50] RECOVERY - Host mw2418 is UP: PING OK - Packet loss = 0%, RTA = 33.19 ms [19:54:52] RECOVERY - Host mw2417 is UP: PING OK - Packet loss = 0%, RTA = 33.17 ms [19:57:16] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on 7 hosts with reason: fresh role user [19:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:27] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: fresh role user [19:57:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:04] RoanKattouw, Urbanecm, and cjming: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220425T2000). [20:00:04] andrewbogott and bwang: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:25] hello! I'm here for the window [20:01:04] I can do the deploy today, but I need a minute to finish assembling lunch [20:01:22] I'm here! My patch is extremely trivial [20:03:20] bwang: if you want to wait on Roan that's fine with me [20:06:21] yep good with me! [20:10:02] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:10:24] OK I'm here now [20:10:51] (03PS3) 10Catrope: Enable TOC for all users opted into modern Vector outside of pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785908 (https://phabricator.wikimedia.org/T306608) (owner: 10Bernard Wang) [20:10:55] (03CR) 10Catrope: [C: 03+2] Enable TOC for all users opted into modern Vector outside of pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785908 (https://phabricator.wikimedia.org/T306608) (owner: 10Bernard Wang) [20:11:43] (03Merged) 10jenkins-bot: Enable TOC for all users opted into modern Vector outside of pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785908 (https://phabricator.wikimedia.org/T306608) (owner: 10Bernard Wang) [20:14:27] bwang: Your patch is on mwdebug1002, please test [20:14:36] RECOVERY - SSH on wtp1035.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:14:36] got it! [20:14:41] (03PS3) 10Catrope: labtestwiki: update labtest ldap server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785399 (https://phabricator.wikimedia.org/T304881) (owner: 10Andrew Bogott) [20:14:51] (03CR) 10Catrope: [C: 03+2] labtestwiki: update labtest ldap server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785399 (https://phabricator.wikimedia.org/T304881) (owner: 10Andrew Bogott) [20:15:34] (03Merged) 10jenkins-bot: labtestwiki: update labtest ldap server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785399 (https://phabricator.wikimedia.org/T304881) (owner: 10Andrew Bogott) [20:17:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:17:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:17:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:54] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:20:17] (03PS1) 10Dzahn: conftool-date: add mw2412 through mw2419 as new appservers [puppet] - 10https://gerrit.wikimedia.org/r/785918 (https://phabricator.wikimedia.org/T290192) [20:21:06] Ok Roan my patch on mwdebug1002 looks good! [20:22:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:22:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:22:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:16] !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:785908|Enable TOC for all users opted into modern Vector outside of pilot wikis (T306608)]] (duration: 01m 40s) [20:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:21] T306608: Deploy ToC to all users opted into the new skin outside pilot wikis - https://phabricator.wikimedia.org/T306608 [20:23:37] bwang: OK yours is deployed [20:24:44] andrewbogott: Yours is on mwdebug1002 for testing, but I'm not sure if you can test it there since it's for labtestwiki. If not, I can just deploy it [20:25:01] RoanKattouw: basically if it didn't break mwdebug visibly then we're good. [20:25:26] It definitely won't change behavior there unless something very bad is happening like a misplaced ; [20:25:37] sweet thx, it looks good on prod [20:25:47] mwdebug looks OK, so deploying [20:25:59] thanks RoanKattouw ! [20:27:44] !log catrope@deploy1002 Synchronized wmf-config/wikitech.php: Config: [[gerrit:785399|labtestwiki: update labtest ldap server (T304881)]] (duration: 01m 39s) [20:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:52] T304881: Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 [20:28:13] andrewbogott: Alright, it's live now, please test on labtestwiki in production :) [20:28:42] (03PS2) 10Dzahn: add a namespace for new service image-suggestions [deployment-charts] - 10https://gerrit.wikimedia.org/r/775964 (https://phabricator.wikimedia.org/T304891) [20:29:02] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:29:10] (03CR) 10Dzahn: add a namespace for new service image-suggestions (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/775964 (https://phabricator.wikimedia.org/T304891) (owner: 10Dzahn) [20:29:35] RoanKattouw: I can log out and in on both wikitech and labtestwikitech which means ldap integration is working. [20:29:39] thanks for the merge [20:29:47] (03PS3) 10Dzahn: add a namespace for new service image-suggestions [deployment-charts] - 10https://gerrit.wikimedia.org/r/775964 (https://phabricator.wikimedia.org/T304891) [20:30:00] (03PS4) 10Dzahn: add a namespace for new service image-suggestions [deployment-charts] - 10https://gerrit.wikimedia.org/r/775964 (https://phabricator.wikimedia.org/T304891) [20:31:33] OK great, then I think this backport window is done! [20:34:56] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:37:02] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:43:40] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:48:31] 10ops-codfw, 10DC-Ops: mw2286 stuck after reboot - https://phabricator.wikimedia.org/T306823 (10RhinosF1) [20:48:55] RoanKattouw: are you still around? [20:49:01] wondering if I can add a late change to the window [20:49:14] (config only) [20:51:21] (03PS1) 10Jdlrobson: [Web scroll] Restore original sampling rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785919 (https://phabricator.wikimedia.org/T305442) [20:55:02] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:57:29] 10ops-codfw, 10DC-Ops: mw2286 stuck after reboot - https://phabricator.wikimedia.org/T306823 (10Dzahn) 20:27 mw2286 timed out during my deployments BTW , is that a known issue? ^ to avoid these we would also have to remove them from scap ("dsh") groups by setting them to "pooled=inactive" [20:57:36] PROBLEM - Memcached on mw2415 is CRITICAL: connect to address 10.192.32.62 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [20:58:00] !log rebooting mw2415 [20:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:43] (03PS1) 10Cwhite: profile: re-enable grafana db sync post 8.x upgrade [puppet] - 10https://gerrit.wikimedia.org/r/785927 [20:58:47] Jdlrobson: Yeah I'm here, sorry for the late response [20:59:10] !log dzahn@cumin2002 conftool action : set/pooled=inactive; selector: dc=codfw,name=mw2286.codfw.wmnet [20:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:22] RoanKattouw: ^ this should have made the difference for that one host [21:00:04] Reedy, sbassett, Maryum, and manfredi: My dear minions, it's time we take the moon! Just kidding. Time for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220425T2100). [21:00:12] PROBLEM - Host mw2415 is DOWN: PING CRITICAL - Packet loss = 100% [21:01:56] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-debian-version-textfile.service,prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:02:12] RECOVERY - Memcached on mw2415 is OK: TCP OK - 0.033 second response time on 10.192.32.62 port 11210 https://wikitech.wikimedia.org/wiki/Memcached [21:02:14] RECOVERY - Host mw2415 is UP: PING OK - Packet loss = 0%, RTA = 33.32 ms [21:03:11] 10ops-codfw, 10DC-Ops: mw2286 stuck after reboot - https://phabricator.wikimedia.org/T306823 (10Dzahn) dcops: The host is depooled, you can work on this any time you want to. [21:03:24] PROBLEM - mediawiki-installation DSH group on mw2413 is CRITICAL: Host mw2413 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [21:03:24] PROBLEM - mediawiki-installation DSH group on mw2414 is CRITICAL: Host mw2414 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [21:03:24] PROBLEM - mediawiki-installation DSH group on mw2417 is CRITICAL: Host mw2417 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [21:03:24] PROBLEM - mediawiki-installation DSH group on mw2416 is CRITICAL: Host mw2416 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [21:03:26] PROBLEM - mediawiki-installation DSH group on mw2419 is CRITICAL: Host mw2419 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [21:03:42] eh.. that wasnt supposed to happen yet but dont worry [21:03:48] they are new hosts not in serviceyet [21:03:50] RoanKattouw: no worries i guess it's too late? [21:03:57] Patch is https://gerrit.wikimedia.org/r/785919 [21:04:12] PROBLEM - mediawiki-installation DSH group on mw2418 is CRITICAL: Host mw2418 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [21:04:15] It depends on whether anyone is doing the security window or if there's nothing to deploy there [21:04:20] ack [21:04:53] Reedy / sbassett / maryum / manfredi: Could you ping me when you're done with the security deploy window? [21:05:11] I don't think there's anything to deploy [21:05:13] RoanKattouw: I don't think any sec patches are going out today... [21:05:14] * Reedy checks [21:06:02] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2413 is CRITICAL: Host mw2413 is not in mediawiki-installation dsh group daniel_zahn new install https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [21:06:02] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2414 is CRITICAL: Host mw2414 is not in mediawiki-installation dsh group daniel_zahn new install https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [21:06:02] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2415 is CRITICAL: Host mw2415 is not in mediawiki-installation dsh group daniel_zahn new install https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [21:06:02] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2416 is CRITICAL: Host mw2416 is not in mediawiki-installation dsh group daniel_zahn new install https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [21:06:02] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2417 is CRITICAL: Host mw2417 is not in mediawiki-installation dsh group daniel_zahn new install https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [21:06:03] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2418 is CRITICAL: Host mw2418 is not in mediawiki-installation dsh group daniel_zahn new install https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [21:06:03] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2419 is CRITICAL: Host mw2419 is not in mediawiki-installation dsh group daniel_zahn new install https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [21:06:45] Yeah, doesn't look to be anything in line to go [21:07:50] (03PS1) 10JHathaway: icinga: remove SMART check [puppet] - 10https://gerrit.wikimedia.org/r/785921 (https://phabricator.wikimedia.org/T294564) [21:08:18] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:09:09] (03CR) 10JHathaway: "Do we still need the conditional check_smart?" [puppet] - 10https://gerrit.wikimedia.org/r/785921 (https://phabricator.wikimedia.org/T294564) (owner: 10JHathaway) [21:09:44] (03CR) 10jerkins-bot: [V: 04-1] icinga: remove SMART check [puppet] - 10https://gerrit.wikimedia.org/r/785921 (https://phabricator.wikimedia.org/T294564) (owner: 10JHathaway) [21:13:47] (03CR) 10Dzahn: "these hosts are still set to status "Staged" in netbox. In the moment we actually pool them we need to not forget to change them to "Activ" [puppet] - 10https://gerrit.wikimedia.org/r/785918 (https://phabricator.wikimedia.org/T290192) (owner: 10Dzahn) [21:16:47] RoanKattouw: so sounds like we can? [21:17:03] Yes, sorry, got distracted. Will deploy your change now [21:17:17] (03CR) 10Catrope: [C: 03+2] [Web scroll] Restore original sampling rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785919 (https://phabricator.wikimedia.org/T305442) (owner: 10Jdlrobson) [21:17:46] RoanKattouw: thanks a bunch you are a lifesaver! [21:17:49] 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frlog1002 - https://phabricator.wikimedia.org/T306839 (10RobH) [21:18:02] (03Merged) 10jenkins-bot: [Web scroll] Restore original sampling rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785919 (https://phabricator.wikimedia.org/T305442) (owner: 10Jdlrobson) [21:18:58] Jdlrobson: Your patch is on mwdebug1002, please test [21:19:09] (03PS2) 10JHathaway: icinga: remove SMART check [puppet] - 10https://gerrit.wikimedia.org/r/785921 (https://phabricator.wikimedia.org/T294564) [21:19:44] RoanKattouw: on it [21:20:18] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): Decom cloudcephmon200[2,3]-dev.wikimedia.org - https://phabricator.wikimedia.org/T306840 (10Andrew) [21:20:44] bingo! you can sync that RoanKattouw [21:23:07] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frbackup2002 - https://phabricator.wikimedia.org/T306842 (10RobH) [21:23:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:23:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:23:26] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frbackup2002 - https://phabricator.wikimedia.org/T306842 (10RobH) [21:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:12] !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:785919|[Web scroll] Restore original sampling rate (T305442)]] (duration: 01m 01s) [21:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:17] T305442: QA ToC Instrumentation - https://phabricator.wikimedia.org/T305442 [21:25:44] 10SRE, 10ops-codfw, 10DC-Ops, 10decommission-hardware, 10cloud-services-team (Kanban): Decom cloudcephmon200[2,3]-dev.wikimedia.org - https://phabricator.wikimedia.org/T306840 (10Andrew) [21:26:15] 10SRE, 10ops-codfw, 10DC-Ops, 10decommission-hardware, 10cloud-services-team (Kanban): Decom cloudcephmon200[2,3]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T306840 (10Andrew) [21:26:19] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudcephmon[2002-2003]-dev.codfw.wmnet [21:26:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:09] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): Decom cloudservices200[2,3]-dev.wikimedia.org - https://phabricator.wikimedia.org/T306669 (10Andrew) [21:28:10] Jdlrobson: It's live! [21:28:29] Thanks RoanKattouw ! Will keep an eye on the related graph [21:31:12] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10Andrew) [21:32:42] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) [21:33:02] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) [21:33:31] (03PS1) 10Andrew Bogott: Remove references to cloudweb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/785922 (https://phabricator.wikimedia.org/T306843) [21:35:24] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [21:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:01] (03PS1) 10Andrew Bogott: Make cloudnet200[45].codfw.wmnet into openstack network nodes [puppet] - 10https://gerrit.wikimedia.org/r/785923 (https://phabricator.wikimedia.org/T304881) [21:38:30] (03CR) 10Andrew Bogott: [C: 03+2] Make cloudnet200[45].codfw.wmnet into openstack network nodes [puppet] - 10https://gerrit.wikimedia.org/r/785923 (https://phabricator.wikimedia.org/T304881) (owner: 10Andrew Bogott) [21:38:30] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:44] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcephmon[2002-2003]-dev.codfw.wmnet [21:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:50] 10SRE, 10ops-codfw, 10DC-Ops, 10decommission-hardware, 10cloud-services-team (Kanban): Decom cloudcephmon200[2,3]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T306840 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: `cloudcephmon[2002-2003]-dev.codfw.... [21:42:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frlog1002 - https://phabricator.wikimedia.org/T306839 (10RobH) [21:42:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frlog1002 - https://phabricator.wikimedia.org/T306839 (10RobH) [21:43:39] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudweb2001-dev.wikimedia.org [21:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:33] 10SRE, 10ops-codfw, 10DC-Ops, 10decommission-hardware, 10cloud-services-team (Kanban): Decom cloudcephmon200[2,3]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T306840 (10Andrew) [21:44:59] (03CR) 10Andrew Bogott: [C: 03+2] Remove references to cloudweb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/785922 (https://phabricator.wikimedia.org/T306843) (owner: 10Andrew Bogott) [21:45:06] (03PS2) 10Andrew Bogott: Remove references to cloudweb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/785922 (https://phabricator.wikimedia.org/T306843) [21:46:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:47:45] 10ops-codfw, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission cloudweb2001-dev.wikimedia.org - https://phabricator.wikimedia.org/T306843 (10Andrew) a:05Andrew→03Papaul [21:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [21:49:06] jouncebot nowandnext [21:49:06] For the next 1 hour(s) and 10 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220425T2100) [21:49:07] In 3 hour(s) and 10 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220426T0100) [21:49:25] Any security stuff happening? [21:49:47] I'm going to test some scap stuff on the deploy server. [21:49:48] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [21:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:36] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:43] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudweb2001-dev.wikimedia.org [22:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:50] 10ops-codfw, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission cloudweb2001-dev.wikimedia.org - https://phabricator.wikimedia.org/T306843 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: `cloudweb2001-dev.wikimedia.org`... [22:01:16] dancy: Should be good [22:01:22] thx [22:04:27] !log dancy@deploy1002 Synchronized README: testing scap mods (duration: 00m 54s) [22:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:26] (03PS1) 10Gergő Tisza: [beta] Reopen beta eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785925 (https://phabricator.wikimedia.org/T306833) [22:37:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:40:43] jouncebot: nowandnext [22:40:43] For the next 0 hour(s) and 19 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220425T2100) [22:40:43] In 2 hour(s) and 19 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220426T0100) [22:45:14] (03PS6) 10Ladsgroup: TimedMediaHandler: Make videojs the only player on all group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612348 (https://phabricator.wikimedia.org/T248418) (owner: 10Jforrester) [22:45:31] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10RobH) [22:45:55] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10RobH) [22:46:34] (03CR) 10Ladsgroup: [C: 03+2] "I removed group1 from disabling beta feature because we need to wait for three weeks for parsercache to expire before disabling the beta f" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612348 (https://phabricator.wikimedia.org/T248418) (owner: 10Jforrester) [22:47:16] (03Merged) 10jenkins-bot: TimedMediaHandler: Make videojs the only player on all group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612348 (https://phabricator.wikimedia.org/T248418) (owner: 10Jforrester) [22:48:20] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:48:28] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T306849 (10RobH) [22:49:12] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:612348|TimedMediaHandler: Make videojs the only player on all group1 (T248418)]] (duration: 00m 54s) [22:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:18] T248418: Roll out videojs as the only video/audio player on all Wikimedia wikis - https://phabricator.wikimedia.org/T248418 [22:49:55] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T306849 (10RobH) [22:50:52] (03PS1) 10Ladsgroup: labs: Set templatelinks migration to read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785947 (https://phabricator.wikimedia.org/T306673) [22:53:19] (03CR) 10Ladsgroup: [C: 03+2] labs: Set templatelinks migration to read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785947 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup) [22:53:56] (03Merged) 10jenkins-bot: labs: Set templatelinks migration to read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785947 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup) [22:54:29] rebased ^ [22:54:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:54:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:01] (03PS1) 10Ladsgroup: ActorMigration: Read from rev_actor field in all of small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785948 (https://phabricator.wikimedia.org/T275246) [22:59:16] (03CR) 10Ntubotu: "hello and good day." [core] (wmf/1.23wmf20) - 10https://gerrit.wikimedia.org/r/123454 (owner: 10MaxSem) [22:59:20] (03CR) 10Ladsgroup: [C: 03+2] ActorMigration: Read from rev_actor field in all of small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785948 (https://phabricator.wikimedia.org/T275246) (owner: 10Ladsgroup) [22:59:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:59:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:04] (03Merged) 10jenkins-bot: ActorMigration: Read from rev_actor field in all of small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785948 (https://phabricator.wikimedia.org/T275246) (owner: 10Ladsgroup) [23:01:48] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:785948|ActorMigration: Read from rev_actor field in all of small wikis (T275246)]] (duration: 00m 57s) [23:01:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:53] T275246: Populate rev_actor and rev_comment_id - https://phabricator.wikimedia.org/T275246 [23:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:02:17] (03CR) 10Ntubotu: Cherry-pick I550eb4b0a8fa18344e8b0de3ec85d61c2122ffb8 (031 comment) [core] (wmf/1.23wmf20) - 10https://gerrit.wikimedia.org/r/123454 (owner: 10MaxSem) [23:02:26] 10SRE, 10ops-codfw, 10DC-Ops: mw2286 stuck after reboot - https://phabricator.wikimedia.org/T306823 (10Dzahn) p:05Triage→03Medium [23:02:38] 10SRE: role_contacts (service owners) as a custom puppet fact - https://phabricator.wikimedia.org/T306830 (10Dzahn) p:05Triage→03Medium [23:04:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [23:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [23:04:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [23:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [23:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:52] (03PS1) 10Gergő Tisza: Backport video landing page changes [extensions/GrowthExperiments] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/785950 (https://phabricator.wikimedia.org/T303785) [23:39:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/installcloudcontrol100[6-7].eqiad.wmnet - https://phabricator.wikimedia.org/T306853 (10RobH) [23:41:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/installcloudcontrol100[6-7].eqiad.wmnet - https://phabricator.wikimedia.org/T306853 (10RobH) a:03Andrew @andrew or @nskaggs, When this info for racking was filled out by @nkaggs, it included using the FQDN of .e... [23:41:58] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Jclark-ctr) [23:42:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/installcloudcontrol100[6-7].eqiad.wmnet - https://phabricator.wikimedia.org/T306853 (10RobH) [23:46:34] 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudcontrol2005-dev, clouddb2002-dev, cloudgw2003-dev - https://phabricator.wikimedia.org/T306854 (10RobH) [23:47:00] 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudcontrol2005-dev, clouddb2002-dev, cloudgw2003-dev - https://phabricator.wikimedia.org/T306854 (10RobH) [23:47:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/installcloudcontrol100[6-7].eqiad.wmnet - https://phabricator.wikimedia.org/T306853 (10RobH) Please note if they do require public IP addresses, please tag in Arzhel as a subscriber so he is aware of the request. [23:49:32] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:55:45] (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable