[00:05:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T307525)', diff saved to https://phabricator.wikimedia.org/P27566 and previous config saved to /var/cache/conftool/dbconfig/20220505-000525-ladsgroup.json [00:05:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [00:05:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [00:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:30] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [00:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T307525)', diff saved to https://phabricator.wikimedia.org/P27567 and previous config saved to /var/cache/conftool/dbconfig/20220505-000533-ladsgroup.json [00:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T307525)', diff saved to https://phabricator.wikimedia.org/P27568 and previous config saved to /var/cache/conftool/dbconfig/20220505-000907-ladsgroup.json [00:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T307525)', diff saved to https://phabricator.wikimedia.org/P27569 and previous config saved to /var/cache/conftool/dbconfig/20220505-001631-ladsgroup.json [00:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:36] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [00:24:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P27570 and previous config saved to /var/cache/conftool/dbconfig/20220505-002412-ladsgroup.json [00:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T307525)', diff saved to https://phabricator.wikimedia.org/P27571 and previous config saved to /var/cache/conftool/dbconfig/20220505-002535-ladsgroup.json [00:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:40] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [00:31:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P27572 and previous config saved to /var/cache/conftool/dbconfig/20220505-003136-ladsgroup.json [00:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P27573 and previous config saved to /var/cache/conftool/dbconfig/20220505-003917-ladsgroup.json [00:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:40:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P27574 and previous config saved to /var/cache/conftool/dbconfig/20220505-004040-ladsgroup.json [00:40:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:34] PROBLEM - DNS on logstash2028.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.193.1.93 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:46:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P27575 and previous config saved to /var/cache/conftool/dbconfig/20220505-004641-ladsgroup.json [00:46:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:54:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T307525)', diff saved to https://phabricator.wikimedia.org/P27576 and previous config saved to /var/cache/conftool/dbconfig/20220505-005422-ladsgroup.json [00:54:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [00:54:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [00:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:54:27] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [00:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:54:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T307525)', diff saved to https://phabricator.wikimedia.org/P27577 and previous config saved to /var/cache/conftool/dbconfig/20220505-005430-ladsgroup.json [00:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:55:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P27578 and previous config saved to /var/cache/conftool/dbconfig/20220505-005545-ladsgroup.json [00:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T307525)', diff saved to https://phabricator.wikimedia.org/P27579 and previous config saved to /var/cache/conftool/dbconfig/20220505-010146-ladsgroup.json [01:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:51] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [01:01:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [01:01:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [01:01:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:02:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T307525)', diff saved to https://phabricator.wikimedia.org/P27580 and previous config saved to /var/cache/conftool/dbconfig/20220505-010201-ladsgroup.json [01:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:52] (03PS3) 10Ssingh: P:cache::varnish::frontend: mask the varnishncsa service [puppet] - 10https://gerrit.wikimedia.org/r/789262 [01:06:09] 10SRE, 10SRE-Access-Requests, 10Generated Data Platform: Request to add user fkaelin to analytics-platform-eng-admins group - https://phabricator.wikimedia.org/T307573 (10fkaelin) @jhathaway thank you for your help - see [[ https://phabricator.wikimedia.org/T267817 | T267817 ]] for my access request details.... [01:06:29] (03PS28) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [01:06:49] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35094/console" [puppet] - 10https://gerrit.wikimedia.org/r/789262 (owner: 10Ssingh) [01:07:03] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [01:10:00] (03PS1) 10Gergő Tisza: Community configuration: Allow writing sub-fields programmatically [extensions/GrowthExperiments] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/789185 (https://phabricator.wikimedia.org/T306792) [01:10:22] (03PS1) 10Gergő Tisza: Community configuration: Allow writing sub-fields programmatically [extensions/GrowthExperiments] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/789326 (https://phabricator.wikimedia.org/T306792) [01:10:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T307525)', diff saved to https://phabricator.wikimedia.org/P27581 and previous config saved to /var/cache/conftool/dbconfig/20220505-011050-ladsgroup.json [01:10:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [01:10:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [01:10:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:10:56] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [01:10:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T307525)', diff saved to https://phabricator.wikimedia.org/P27582 and previous config saved to /var/cache/conftool/dbconfig/20220505-011059-ladsgroup.json [01:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:11:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T307525)', diff saved to https://phabricator.wikimedia.org/P27583 and previous config saved to /var/cache/conftool/dbconfig/20220505-011155-ladsgroup.json [01:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:16:18] PROBLEM - SSH on furud.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:27:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P27584 and previous config saved to /var/cache/conftool/dbconfig/20220505-012700-ladsgroup.json [01:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:55] (03PS29) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [01:28:28] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [01:33:08] (03CR) 10jerkins-bot: [V: 04-1] Community configuration: Allow writing sub-fields programmatically [extensions/GrowthExperiments] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/789185 (https://phabricator.wikimedia.org/T306792) (owner: 10Gergő Tisza) [01:38:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T307525)', diff saved to https://phabricator.wikimedia.org/P27585 and previous config saved to /var/cache/conftool/dbconfig/20220505-013838-ladsgroup.json [01:38:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:38:43] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [01:40:45] (JobUnavailable) firing: (5) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P27586 and previous config saved to /var/cache/conftool/dbconfig/20220505-014205-ladsgroup.json [01:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:43:11] (03PS30) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [01:44:51] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [01:50:45] (JobUnavailable) firing: (5) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:53:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P27587 and previous config saved to /var/cache/conftool/dbconfig/20220505-015343-ladsgroup.json [01:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:54:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T307525)', diff saved to https://phabricator.wikimedia.org/P27588 and previous config saved to /var/cache/conftool/dbconfig/20220505-015409-ladsgroup.json [01:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:54:13] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [01:57:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T307525)', diff saved to https://phabricator.wikimedia.org/P27589 and previous config saved to /var/cache/conftool/dbconfig/20220505-015712-ladsgroup.json [01:57:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [01:57:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [01:57:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:57:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:08:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P27590 and previous config saved to /var/cache/conftool/dbconfig/20220505-020848-ladsgroup.json [02:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:09:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P27591 and previous config saved to /var/cache/conftool/dbconfig/20220505-020914-ladsgroup.json [02:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:13:22] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:15:43] (03PS31) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [02:17:26] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [02:17:28] RECOVERY - SSH on furud.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:22:36] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:23:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T307525)', diff saved to https://phabricator.wikimedia.org/P27592 and previous config saved to /var/cache/conftool/dbconfig/20220505-022354-ladsgroup.json [02:23:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [02:23:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [02:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:23:59] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [02:24:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:24:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T307525)', diff saved to https://phabricator.wikimedia.org/P27593 and previous config saved to /var/cache/conftool/dbconfig/20220505-022402-ladsgroup.json [02:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:24:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P27594 and previous config saved to /var/cache/conftool/dbconfig/20220505-022419-ladsgroup.json [02:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:36:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T307525)', diff saved to https://phabricator.wikimedia.org/P27595 and previous config saved to /var/cache/conftool/dbconfig/20220505-023633-ladsgroup.json [02:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:36:39] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [02:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:39:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T307525)', diff saved to https://phabricator.wikimedia.org/P27596 and previous config saved to /var/cache/conftool/dbconfig/20220505-023924-ladsgroup.json [02:39:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:39:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [02:39:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [02:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:39:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:39:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T307525)', diff saved to https://phabricator.wikimedia.org/P27597 and previous config saved to /var/cache/conftool/dbconfig/20220505-023948-ladsgroup.json [02:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:42:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [02:42:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [02:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:42:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db[2074,2094,2109,2127,2149].codfw.wmnet with reason: Maintenance [02:42:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db[2074,2094,2109,2127,2149].codfw.wmnet with reason: Maintenance [02:42:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:44:27] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10cmooney) @Jclark-ctr Could you do me a favour and ping me before you kick off the re-image process for aqs1020/aqs1021? Juniper have come back with a response on o... [02:51:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P27598 and previous config saved to /var/cache/conftool/dbconfig/20220505-025138-ladsgroup.json [02:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:52:48] (03PS32) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [02:53:22] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [03:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:06:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P27599 and previous config saved to /var/cache/conftool/dbconfig/20220505-030644-ladsgroup.json [03:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:19:00] (03PS33) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [03:20:41] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [03:21:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T307525)', diff saved to https://phabricator.wikimedia.org/P27600 and previous config saved to /var/cache/conftool/dbconfig/20220505-032149-ladsgroup.json [03:21:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [03:21:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [03:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:21:54] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [03:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:23:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T307525)', diff saved to https://phabricator.wikimedia.org/P27601 and previous config saved to /var/cache/conftool/dbconfig/20220505-032337-ladsgroup.json [03:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:32:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [03:32:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [03:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:32:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [03:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:32:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [03:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:36:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [03:36:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [03:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:38:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P27602 and previous config saved to /var/cache/conftool/dbconfig/20220505-033842-ladsgroup.json [03:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:42:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [03:42:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [03:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:46:54] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:49:14] PROBLEM - SSH on wtp1037.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:51:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [03:51:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [03:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:51:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T307525)', diff saved to https://phabricator.wikimedia.org/P27603 and previous config saved to /var/cache/conftool/dbconfig/20220505-035158-ladsgroup.json [03:52:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:52:03] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [03:53:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P27604 and previous config saved to /var/cache/conftool/dbconfig/20220505-035347-ladsgroup.json [03:53:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:55:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [04:03:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T307525)', diff saved to https://phabricator.wikimedia.org/P27605 and previous config saved to /var/cache/conftool/dbconfig/20220505-040344-ladsgroup.json [04:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:03:50] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [04:08:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T307525)', diff saved to https://phabricator.wikimedia.org/P27606 and previous config saved to /var/cache/conftool/dbconfig/20220505-040852-ladsgroup.json [04:08:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [04:08:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [04:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:08:58] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [04:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:09:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T307525)', diff saved to https://phabricator.wikimedia.org/P27607 and previous config saved to /var/cache/conftool/dbconfig/20220505-040900-ladsgroup.json [04:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:18:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P27608 and previous config saved to /var/cache/conftool/dbconfig/20220505-041850-ladsgroup.json [04:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:33:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P27609 and previous config saved to /var/cache/conftool/dbconfig/20220505-043354-ladsgroup.json [04:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:39:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [04:39:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [04:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:49:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T307525)', diff saved to https://phabricator.wikimedia.org/P27610 and previous config saved to /var/cache/conftool/dbconfig/20220505-044900-ladsgroup.json [04:49:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [04:49:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [04:49:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:49:05] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [04:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:49:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T307525)', diff saved to https://phabricator.wikimedia.org/P27611 and previous config saved to /var/cache/conftool/dbconfig/20220505-044908-ladsgroup.json [04:49:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:50:28] RECOVERY - SSH on wtp1037.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:50:34] 10SRE, 10Data-Persistence-Backup, 10database-backups, 10observability: Icinga db alerts during backup/restore - https://phabricator.wikimedia.org/T307639 (10Ladsgroup) [04:52:22] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 108 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:53:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T307525)', diff saved to https://phabricator.wikimedia.org/P27612 and previous config saved to /var/cache/conftool/dbconfig/20220505-045329-ladsgroup.json [04:53:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T307525)', diff saved to https://phabricator.wikimedia.org/P27613 and previous config saved to /var/cache/conftool/dbconfig/20220505-045450-ladsgroup.json [04:54:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:55] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [04:55:30] PROBLEM - Persistent high iowait on labstore1006 is CRITICAL: 13.44 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005-1006-1007 [05:03:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [05:03:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [05:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:04:06] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:08:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P27614 and previous config saved to /var/cache/conftool/dbconfig/20220505-050835-ladsgroup.json [05:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:09:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P27615 and previous config saved to /var/cache/conftool/dbconfig/20220505-050955-ladsgroup.json [05:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:14] RECOVERY - Persistent high iowait on labstore1006 is OK: (C)10 ge (W)5 ge 4.975 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005-1006-1007 [05:23:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P27616 and previous config saved to /var/cache/conftool/dbconfig/20220505-052340-ladsgroup.json [05:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P27618 and previous config saved to /var/cache/conftool/dbconfig/20220505-052500-ladsgroup.json [05:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [05:26:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [05:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 6 hosts with reason: Maintenance [05:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 6 hosts with reason: Maintenance [05:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T307525)', diff saved to https://phabricator.wikimedia.org/P27620 and previous config saved to /var/cache/conftool/dbconfig/20220505-053845-ladsgroup.json [05:38:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [05:38:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [05:38:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [05:38:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [05:38:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T307525)', diff saved to https://phabricator.wikimedia.org/P27621 and previous config saved to /var/cache/conftool/dbconfig/20220505-053858-ladsgroup.json [05:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:16] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [05:39:18] (ProbeDown) firing: (10) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:39:18] (ProbeDown) firing: (20) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:39] I can't load enwp [05:39:45] or it's very slow [05:40:04] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [05:40:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T307525)', diff saved to https://phabricator.wikimedia.org/P27622 and previous config saved to /var/cache/conftool/dbconfig/20220505-054005-ladsgroup.json [05:40:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance [05:40:08] PROBLEM - Apache HTTP on mw1321 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:40:08] PROBLEM - Apache HTTP on mw1349 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:40:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance [05:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:10] PROBLEM - Apache HTTP on mw1456 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:40:10] db1127 seems to be down [05:40:10] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.07182 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [05:40:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T307525)', diff saved to https://phabricator.wikimedia.org/P27623 and previous config saved to /var/cache/conftool/dbconfig/20220505-054013-ladsgroup.json [05:40:14] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [05:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:16] PROBLEM - Apache HTTP on mw1370 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:40:16] PROBLEM - Apache HTTP on mw1350 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1127', diff saved to https://phabricator.wikimedia.org/P27624 and previous config saved to /var/cache/conftool/dbconfig/20220505-054027-marostegui.json [05:40:29] <_joe_> here [05:40:30] PROBLEM - Apache HTTP on mw1369 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:40:30] PROBLEM - Apache HTTP on mw1355 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:40:30] PROBLEM - Apache HTTP on mw1351 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:40:30] depooled db1127 [05:40:32] PROBLEM - Apache HTTP on mw1441 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:40:33] it seems to be down [05:40:34] PROBLEM - Apache HTTP on mw1327 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:40:38] PROBLEM - Apache HTTP on mw1442 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:40:40] PROBLEM - Apache HTTP on mw1418 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:40:40] PROBLEM - Apache HTTP on mw1433 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:40:41] <_joe_> please stop doing any schema changes now [05:40:42] PROBLEM - Apache HTTP on mw1333 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:40:48] PROBLEM - Apache HTTP on mw1368 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:40:52] let me know if you need help, I just woke up [05:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:56] PROBLEM - Apache HTTP on mw1430 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:40:56] PROBLEM - Apache HTTP on mw1331 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:40:58] PROBLEM - Apache HTTP on mw1326 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:40:58] PROBLEM - Apache HTTP on mw1414 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:41:04] PROBLEM - Apache HTTP on mw1415 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:41:04] PROBLEM - Apache HTTP on mw1364 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:41:05] <_joe_> XioNoX: ack the alerts? [05:41:12] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page on alert1001 is CRITICAL: 0.002445 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver [05:41:14] PROBLEM - Apache HTTP on mw1384 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:41:15] <_joe_> I can't ssh into the servers [05:41:16] PROBLEM - Apache HTTP on mw1373 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:41:20] PROBLEM - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:41:22] PROBLEM - Apache HTTP on mw1328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:41:24] PROBLEM - Apache HTTP on mw1435 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:41:30] PROBLEM - Apache HTTP on mw1325 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:41:34] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [05:41:34] PROBLEM - Apache HTTP on mw1389 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:41:36] PROBLEM - Apache HTTP on mw1436 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:41:38] PROBLEM - Apache HTTP on mw1353 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:41:38] PROBLEM - Apache HTTP on mw1366 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:41:40] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 2284 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:41:41] <_joe_> can someone ssh into the mw servers? [05:41:42] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [05:41:42] PROBLEM - Apache HTTP on mw1372 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:41:44] PROBLEM - Apache HTTP on mw1371 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:41:47] _joe_: nop [05:41:50] PROBLEM - LVS text-https eqsin port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.eqsin.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:41:54] PROBLEM - Apache HTTP on mw1417 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:41:56] PROBLEM - Apache HTTP on mw1330 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:41:58] PROBLEM - Apache HTTP on mw1332 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:42:00] PROBLEM - LVS text-https ulsfo port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:42:04] PROBLEM - Apache HTTP on mw1401 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:42:04] PROBLEM - Apache HTTP on mw1399 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:42:04] PROBLEM - Apache HTTP on mw1411 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:42:04] PROBLEM - Apache HTTP on mw1429 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:42:04] PROBLEM - Apache HTTP on mw1455 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:42:05] PROBLEM - Apache HTTP on mw1431 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:42:05] PROBLEM - Apache HTTP on mw1453 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:42:06] PROBLEM - Apache HTTP on mw1409 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:42:06] PROBLEM - Apache HTTP on mw1365 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:42:07] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1433.eqiad.wmnet, mw1365.eqiad.wmnet, mw1419.eqiad.wmnet, mw1442.eqiad.wmnet, mw1434.eqiad.wmnet, mw1432.eqiad.wmnet, mw1349.eqiad.wmnet, mw1384.eqiad.wmnet, mw1387.eqiad.wmnet, mw1430.eqiad.wmnet, mw1415.eqiad.wmnet, mw1416.eqiad.wmnet, mw1405.eqiad.wmnet, mw1329.eqiad.wmnet, mw1320.eqiad.wmnet, mw1399.eqiad.wmnet, mw [05:42:07] ad.wmnet, mw1420.eqiad.wmnet, mw1333.eqiad.wmnet, mw1393.eqiad.wmnet, mw1366.eqiad.wmnet, mw1372.eqiad.wmnet, mw1370.eqiad.wmnet, mw1389.eqiad.wmnet, mw1395.eqiad.wmnet, mw1397.eqiad.wmnet, mw1325.eqiad.wmnet, mw1385.eqiad.wmnet, mw1369.eqiad.wmnet, mw1455.eqiad.wmnet, mw1409.eqiad.wmnet, mw1436.eqiad.wmnet, mw1332.eqiad.wmnet, mw1452.eqiad.wmnet, mw1367.eqiad.wmnet, mw1414.eqiad.wmnet, mw1417.eqiad.wmnet, mw1371.eqiad.wmnet, mw1453.eqiad [05:42:08] mw1322.eqiad.wmnet, mw1319.eqiad.wmnet, mw1323.eqiad.wmnet, mw1454.eqiad.wmnet, mw1327.eqiad.wmnet, mw1413.eqiad.wmnet, mw1364.eqiad.wmnet, mw1351.eqiad.wmnet, mw1391.eqiad.wmnet, mw135 https://wikitech.wikimedia.org/wiki/PyBal [05:42:08] PROBLEM - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:42:09] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1451.eqiad.wmnet, mw1433.eqiad.wmnet, mw1414.eqiad.wmnet, mw1397.eqiad.wmnet, mw1420.eqiad.wmnet, mw1365.eqiad.wmnet, mw1455.eqiad.wmnet, mw1442.eqiad.wmnet, mw1434.eqiad.wmnet, mw1366.eqiad.wmnet, mw1322.eqiad.wmnet, mw1432.eqiad.wmnet, mw1349.eqiad.wmnet, mw1384.eqiad.wmnet, mw1327.eqiad.wmnet, mw1413.eqiad.wmnet, mw [05:42:09] ad.wmnet, mw1407.eqiad.wmnet, mw1430.eqiad.wmnet, mw1415.eqiad.wmnet, mw1351.eqiad.wmnet, mw1416.eqiad.wmnet, mw1391.eqiad.wmnet, mw1329.eqiad.wmnet, mw1320.eqiad.wmnet, mw1352.eqiad.wmnet, mw1399.eqiad.wmnet, mw1355.eqiad.wmnet, mw1326.eqiad.wmnet, mw1435.eqiad.wmnet, mw1419.eqiad.wmnet, mw1371.eqiad.wmnet, mw1431.eqiad.wmnet, mw1333.eqiad.wmnet, mw1401.eqiad.wmnet, mw1418.eqiad.wmnet, mw1324.eqiad.wmnet, mw1372.eqiad.wmnet, mw1370.eqiad [05:42:10] mw1429.eqiad.wmnet, mw1389.eqiad.wmnet, mw1331.eqiad.wmnet, mw1319.eqiad.wmnet, mw1395.eqiad.wmnet, mw1403.eqiad.wmnet, mw1325.eqiad.wmnet, mw1409.eqiad.wmnet, mw1411.eqiad.wmnet, mw141 https://wikitech.wikimedia.org/wiki/PyBal [05:42:10] PROBLEM - Apache HTTP on mw1324 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:42:11] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 866 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:42:14] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is CRITICAL: 0.9726 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [05:42:16] PROBLEM - Apache HTTP on mw1434 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:42:16] PROBLEM - Apache HTTP on mw1405 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:42:18] PROBLEM - Apache HTTP on mw1454 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:42:18] PROBLEM - Apache HTTP on mw1395 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:42:20] PROBLEM - Apache HTTP on mw1322 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:42:30] PROBLEM - Apache HTTP on mw1419 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:42:30] PROBLEM - Apache HTTP on mw1329 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:42:30] PROBLEM - Apache HTTP on mw1420 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:42:30] PROBLEM - Apache HTTP on mw1387 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:42:32] PROBLEM - Apache HTTP on mw1352 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:42:32] PROBLEM - Apache HTTP on mw1319 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:42:34] PROBLEM - Apache HTTP on mw1452 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:42:34] PROBLEM - Apache HTTP on mw1320 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:42:36] all s7 is having too many connections [05:42:37] PROBLEM - LVS text-https esams port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:42:38] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.8548 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [05:42:48] PROBLEM - Apache HTTP on mw1385 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:42:52] PROBLEM - LVS text-https eqsin port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.eqsin.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:42:56] PROBLEM - Apache HTTP on mw1407 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:42:58] PROBLEM - Apache HTTP on mw1451 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:43:00] PROBLEM - LVS text-https codfw port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.codfw.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:43:00] PROBLEM - Apache HTTP on mw1432 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:43:02] PROBLEM - Apache HTTP on mw1354 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:43:06] PROBLEM - Apache HTTP on mw1391 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:43:08] PROBLEM - Apache HTTP on mw1397 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:43:10] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [05:43:14] PROBLEM - LVS text-https esams port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:43:15] PROBLEM - LVS text-https codfw port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.codfw.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:43:16] PROBLEM - Apache HTTP on mw1393 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:43:16] PROBLEM - Apache HTTP on mw1416 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:43:16] PROBLEM - Apache HTTP on mw1323 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:43:20] PROBLEM - LVS text-https drmrs port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.drmrs.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:43:22] PROBLEM - PHP7 rendering on mw1321 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:43:28] PROBLEM - Apache HTTP on mw1367 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:43:28] PROBLEM - PHP7 rendering on mw1414 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:43:30] PROBLEM - PHP7 rendering on mw1384 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:43:40] PROBLEM - LVS appservers-https eqiad port 443/tcp - Main MediaWiki application server cluster- appservers.svc.eqiad.wmnet -https- IPv4 #page on appservers.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.049 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:43:48] PROBLEM - Apache HTTP on mw1403 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:43:54] PROBLEM - PHP7 rendering on mw1333 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:43:58] PROBLEM - PHP7 rendering on mw1420 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:44:00] PROBLEM - Apache HTTP on mw1413 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:44:04] PROBLEM - LVS text-https ulsfo port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.ulsfo.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:44:05] PROBLEM - LVS text-https drmrs port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.drmrs.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:44:06] PROBLEM - PHP7 rendering on mw1385 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:44:06] PROBLEM - PHP7 rendering on mw1320 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:44:10] PROBLEM - PHP7 rendering on mw1417 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:44:10] PROBLEM - PHP7 rendering on mw1441 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:44:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T307525)', diff saved to https://phabricator.wikimedia.org/P27625 and previous config saved to /var/cache/conftool/dbconfig/20220505-054411-ladsgroup.json [05:44:18] (ProbeDown) firing: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:44:18] (ProbeDown) firing: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:44:22] PROBLEM - PHP7 rendering on mw1435 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:44:22] PROBLEM - PHP7 rendering on mw1418 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:44:22] PROBLEM - PHP7 rendering on mw1364 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:44:24] PROBLEM - PHP7 rendering on mw1433 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:44:30] PROBLEM - PHP7 rendering on mw1353 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:44:40] PROBLEM - PHP7 rendering on mw1351 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:44:40] PROBLEM - PHP7 rendering on mw1332 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:41] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [05:44:42] PROBLEM - PHP7 rendering on mw1371 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:44:56] PROBLEM - PHP7 rendering on mw1326 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:45:00] PROBLEM - PHP7 rendering on mw1355 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:45:00] PROBLEM - PHP7 rendering on mw1322 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:45:00] PROBLEM - PHP7 rendering on mw1349 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:45:04] PROBLEM - PHP7 rendering on mw1407 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:45:04] PROBLEM - PHP7 rendering on mw1330 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:45:06] PROBLEM - PHP7 rendering on mw1411 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:45:08] PROBLEM - PHP7 rendering on mw1403 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:45:08] PROBLEM - PHP7 rendering on mw1453 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:45:08] PROBLEM - PHP7 rendering on mw1429 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:45:08] PROBLEM - PHP7 rendering on mw1415 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:45:08] PROBLEM - PHP7 rendering on mw1367 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:45:18] PROBLEM - PHP7 rendering on mw1395 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:45:22] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:45:22] PROBLEM - PHP7 rendering on mw1323 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:45:24] PROBLEM - PHP7 rendering on mw1370 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:45:24] PROBLEM - PHP7 rendering on mw1350 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:45:24] PROBLEM - PHP7 rendering on mw1369 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:45:24] PROBLEM - PHP7 rendering on mw1373 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:45:24] PROBLEM - PHP7 rendering on mw1354 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:45:25] PROBLEM - PHP7 rendering on mw1352 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:45:30] PROBLEM - PHP7 rendering on mw1387 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:45:30] PROBLEM - PHP7 rendering on mw1389 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:45:30] PROBLEM - PHP7 rendering on mw1456 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:45:30] PROBLEM - PHP7 rendering on mw1434 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:45:30] PROBLEM - PHP7 rendering on mw1432 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:45:31] PROBLEM - PHP7 rendering on mw1397 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:45:38] PROBLEM - PHP7 rendering on mw1331 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:45:38] PROBLEM - PHP7 rendering on mw1365 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:45:42] PROBLEM - PHP7 rendering on mw1430 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:45:42] PROBLEM - PHP7 rendering on mw1401 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:45:50] PROBLEM - PHP7 rendering on mw1329 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:45:54] PROBLEM - PHP7 rendering on mw1324 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:45:54] PROBLEM - PHP7 rendering on mw1442 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:45:54] PROBLEM - PHP7 rendering on mw1393 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:45:54] PROBLEM - PHP7 rendering on mw1319 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:45:56] PROBLEM - PHP7 rendering on mw1366 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:45:56] PROBLEM - PHP7 rendering on mw1368 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:45:58] PROBLEM - PHP7 rendering on mw1419 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:45:58] PROBLEM - PHP7 rendering on mw1328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:46:02] PROBLEM - PHP7 rendering on mw1413 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:46:04] PROBLEM - PHP7 rendering on mw1431 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:46:16] PROBLEM - PHP7 rendering on mw1416 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:46:18] PROBLEM - PHP7 rendering on mw1325 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:46:24] PROBLEM - PHP7 rendering on mw1405 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:46:24] PROBLEM - PHP7 rendering on mw1409 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:46:24] PROBLEM - PHP7 rendering on mw1327 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:46:24] PROBLEM - PHP7 rendering on mw1452 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:46:34] PROBLEM - PHP7 rendering on mw1372 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:46:34] PROBLEM - PHP7 rendering on mw1454 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:46:38] PROBLEM - PHP7 rendering on mw1391 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:46:40] PROBLEM - PHP7 rendering on mw1399 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:46:50] PROBLEM - PHP7 rendering on mw1436 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:47:08] PROBLEM - PHP7 rendering on mw1451 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:47:26] PROBLEM - PHP7 rendering on mw1455 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:48:08] RECOVERY - PHP7 rendering on mw1329 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 7.305 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:49:08] RECOVERY - Apache HTTP on mw1409 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 3.974 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:49:42] RECOVERY - PHP7 rendering on mw1411 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 5.142 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:50:48] RECOVERY - PHP7 rendering on mw1416 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 2.663 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:50:50] RECOVERY - Apache HTTP on mw1435 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 8.128 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:50:52] RECOVERY - Apache HTTP on mw1403 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 8.914 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:50:54] RECOVERY - PHP7 rendering on mw1333 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 7.404 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:50:54] RECOVERY - PHP7 rendering on mw1325 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 7.964 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:50:55] <_joe_> yes it was it marostegui [05:50:56] RECOVERY - Apache HTTP on mw1325 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 7.575 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:50:56] RECOVERY - PHP7 rendering on mw1420 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 4.403 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:50:56] RECOVERY - PHP7 rendering on mw1409 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 3.635 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:50:56] RECOVERY - PHP7 rendering on mw1452 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 3.948 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:50:56] RECOVERY - Apache HTTP on mw1389 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 3.980 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:50:57] RECOVERY - PHP7 rendering on mw1405 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 4.455 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:50:57] RECOVERY - Apache HTTP on mw1413 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 4.504 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:50:58] RECOVERY - Apache HTTP on mw1436 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 3.440 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:51:00] RECOVERY - PHP7 rendering on mw1327 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 6.688 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:51:00] RECOVERY - Apache HTTP on mw1353 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 1.549 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:51:00] RECOVERY - PHP7 rendering on mw1385 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.640 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:51:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:51:02] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [05:51:02] RECOVERY - Apache HTTP on mw1366 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 3.849 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:51:02] RECOVERY - LVS text-https drmrs port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.drmrs.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 19106 bytes in 4.017 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:51:03] RECOVERY - LVS text-https ulsfo port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 19093 bytes in 4.645 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:51:03] RECOVERY - Apache HTTP on mw1372 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 3.017 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:51:04] RECOVERY - PHP7 rendering on mw1454 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:51:04] RECOVERY - PHP7 rendering on mw1320 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 3.663 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:51:04] RECOVERY - Apache HTTP on mw1371 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 1.247 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:51:04] RECOVERY - PHP7 rendering on mw1372 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 1.494 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:51:06] RECOVERY - PHP7 rendering on mw1417 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:51:06] RECOVERY - PHP7 rendering on mw1441 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:51:08] RECOVERY - PHP7 rendering on mw1391 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:51:10] RECOVERY - LVS text-https eqsin port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.eqsin.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 19093 bytes in 1.562 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:51:10] RECOVERY - PHP7 rendering on mw1399 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:51:16] RECOVERY - Apache HTTP on mw1417 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.038 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:51:16] RECOVERY - PHP7 rendering on mw1435 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:51:16] RECOVERY - PHP7 rendering on mw1418 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:51:16] RECOVERY - PHP7 rendering on mw1364 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:51:16] RECOVERY - Apache HTTP on mw1330 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.088 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:51:18] RECOVERY - PHP7 rendering on mw1433 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:51:18] RECOVERY - Apache HTTP on mw1332 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.059 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:51:20] RECOVERY - PHP7 rendering on mw1436 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:51:22] RECOVERY - LVS text-https ulsfo port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 19106 bytes in 0.467 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:51:24] RECOVERY - Apache HTTP on mw1401 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:51:24] RECOVERY - Apache HTTP on mw1399 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:51:24] RECOVERY - Apache HTTP on mw1455 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:51:24] RECOVERY - Apache HTTP on mw1429 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:51:24] RECOVERY - Apache HTTP on mw1411 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.057 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:51:25] RECOVERY - Apache HTTP on mw1431 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:51:25] RECOVERY - PHP7 rendering on mw1353 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:51:26] RECOVERY - Apache HTTP on mw1453 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:51:26] RECOVERY - Apache HTTP on mw1365 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:51:27] RECOVERY - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 19105 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:51:32] RECOVERY - Apache HTTP on mw1324 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:51:34] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:51:34] RECOVERY - PHP7 rendering on mw1332 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:51:34] RECOVERY - PHP7 rendering on mw1351 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:51:34] RECOVERY - PHP7 rendering on mw1371 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:51:36] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:51:36] RECOVERY - Apache HTTP on mw1405 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:51:36] RECOVERY - Apache HTTP on mw1454 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:51:36] RECOVERY - Apache HTTP on mw1434 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:51:36] RECOVERY - Apache HTTP on mw1395 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:51:38] RECOVERY - Apache HTTP on mw1322 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.057 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:51:40] RECOVERY - PHP7 rendering on mw1451 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:51:50] RECOVERY - Apache HTTP on mw1419 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:51:50] RECOVERY - PHP7 rendering on mw1326 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:51:50] RECOVERY - Apache HTTP on mw1420 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:51:50] RECOVERY - Apache HTTP on mw1387 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:51:50] RECOVERY - Apache HTTP on mw1329 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:51:51] RECOVERY - Apache HTTP on mw1321 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.082 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:51:51] RECOVERY - Apache HTTP on mw1349 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.083 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:51:52] RECOVERY - Apache HTTP on mw1319 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.057 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:51:52] RECOVERY - Apache HTTP on mw1352 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.058 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:51:53] RECOVERY - Apache HTTP on mw1456 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:51:53] RECOVERY - PHP7 rendering on mw1322 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:51:54] RECOVERY - PHP7 rendering on mw1355 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:51:54] RECOVERY - PHP7 rendering on mw1349 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:51:55] RECOVERY - Apache HTTP on mw1452 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:51:55] RECOVERY - Apache HTTP on mw1320 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.092 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:51:56] RECOVERY - Apache HTTP on mw1370 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:51:56] RECOVERY - PHP7 rendering on mw1407 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:51:57] RECOVERY - PHP7 rendering on mw1330 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:51:58] RECOVERY - LVS text-https esams port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 19106 bytes in 0.549 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:51:58] RECOVERY - PHP7 rendering on mw1455 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:52:00] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.5016 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [05:52:00] RECOVERY - Apache HTTP on mw1350 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:52:02] RECOVERY - PHP7 rendering on mw1429 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:52:02] RECOVERY - PHP7 rendering on mw1403 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:52:02] RECOVERY - PHP7 rendering on mw1453 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:52:04] RECOVERY - PHP7 rendering on mw1415 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:52:04] RECOVERY - PHP7 rendering on mw1367 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:52:10] RECOVERY - Apache HTTP on mw1385 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:52:14] RECOVERY - Apache HTTP on mw1355 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:52:14] RECOVERY - PHP7 rendering on mw1395 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:52:14] RECOVERY - Apache HTTP on mw1369 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:52:14] RECOVERY - Apache HTTP on mw1351 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:52:16] RECOVERY - LVS text-https eqsin port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.eqsin.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 19106 bytes in 1.498 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:52:16] RECOVERY - Apache HTTP on mw1441 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.032 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:52:16] RECOVERY - PHP7 rendering on mw1323 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:52:17] RECOVERY - Apache HTTP on mw1407 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:52:18] RECOVERY - PHP7 rendering on mw1370 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:52:18] RECOVERY - PHP7 rendering on mw1369 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:52:18] RECOVERY - Apache HTTP on mw1327 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:52:18] RECOVERY - PHP7 rendering on mw1350 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:52:18] RECOVERY - PHP7 rendering on mw1352 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:52:19] RECOVERY - PHP7 rendering on mw1354 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.057 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:52:19] RECOVERY - PHP7 rendering on mw1373 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.093 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:52:20] RECOVERY - Apache HTTP on mw1442 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:52:20] RECOVERY - Apache HTTP on mw1451 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:52:24] RECOVERY - Apache HTTP on mw1432 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:52:24] RECOVERY - LVS text-https codfw port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.codfw.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 19093 bytes in 0.309 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:52:26] RECOVERY - Apache HTTP on mw1433 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:52:26] RECOVERY - Apache HTTP on mw1418 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:52:26] RECOVERY - PHP7 rendering on mw1389 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:52:26] RECOVERY - Apache HTTP on mw1354 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:52:26] RECOVERY - PHP7 rendering on mw1387 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:52:27] RECOVERY - PHP7 rendering on mw1432 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:52:27] RECOVERY - PHP7 rendering on mw1434 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:52:28] RECOVERY - PHP7 rendering on mw1456 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:52:28] RECOVERY - PHP7 rendering on mw1397 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:52:29] RECOVERY - Apache HTTP on mw1333 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:52:30] RECOVERY - Apache HTTP on mw1391 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:52:32] RECOVERY - Apache HTTP on mw1397 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:52:34] RECOVERY - Apache HTTP on mw1368 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:52:36] RECOVERY - PHP7 rendering on mw1331 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:52:36] RECOVERY - PHP7 rendering on mw1365 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.060 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:52:38] RECOVERY - LVS text-https codfw port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 19106 bytes in 0.349 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:52:39] RECOVERY - LVS text-https esams port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 19092 bytes in 0.498 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:52:39] RECOVERY - PHP7 rendering on mw1430 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:52:39] RECOVERY - PHP7 rendering on mw1401 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:52:40] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [05:52:42] RECOVERY - Apache HTTP on mw1416 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:52:42] RECOVERY - Apache HTTP on mw1393 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:52:42] RECOVERY - Apache HTTP on mw1323 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:52:44] RECOVERY - Apache HTTP on mw1331 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:52:44] RECOVERY - Apache HTTP on mw1430 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:52:46] RECOVERY - Apache HTTP on mw1326 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:52:46] RECOVERY - Apache HTTP on mw1414 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:52:46] RECOVERY - LVS text-https drmrs port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.drmrs.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 19092 bytes in 0.626 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:52:47] RECOVERY - PHP7 rendering on mw1321 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:52:52] RECOVERY - Apache HTTP on mw1367 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:52:52] RECOVERY - PHP7 rendering on mw1414 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.099 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:52:54] RECOVERY - Apache HTTP on mw1415 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:52:54] RECOVERY - PHP7 rendering on mw1442 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:52:54] RECOVERY - Apache HTTP on mw1364 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:52:54] RECOVERY - PHP7 rendering on mw1324 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:52:54] RECOVERY - PHP7 rendering on mw1384 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:52:55] RECOVERY - PHP7 rendering on mw1393 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:52:55] RECOVERY - PHP7 rendering on mw1319 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:52:56] RECOVERY - PHP7 rendering on mw1366 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:52:56] RECOVERY - PHP7 rendering on mw1368 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:52:58] RECOVERY - PHP7 rendering on mw1419 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:52:58] RECOVERY - PHP7 rendering on mw1328 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:53:02] RECOVERY - PHP7 rendering on mw1413 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.034 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:53:04] RECOVERY - Apache HTTP on mw1384 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:53:04] RECOVERY - PHP7 rendering on mw1431 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:53:04] RECOVERY - Apache HTTP on mw1373 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:53:08] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.8163 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver [05:53:10] RECOVERY - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 19092 bytes in 0.096 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:53:14] RECOVERY - Apache HTTP on mw1328 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:53:16] RECOVERY - LVS appservers-https eqiad port 443/tcp - Main MediaWiki application server cluster- appservers.svc.eqiad.wmnet -https- IPv4 #page on appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 17976 bytes in 1.143 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:53:40] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:53:40] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [05:54:05] 10SRE-OnFire, 10DBA, 10Blocked-on-schema-change: Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501 (10lmata) [05:54:08] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 2 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:54:12] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [05:54:18] (ProbeDown) resolved: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:54:19] (ProbeDown) resolved: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:54:22] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [05:54:32] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [05:54:34] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [05:54:43] <_joe_> ok, what is still alerting? [05:55:08] <_joe_> nothing AFAICS [05:55:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T307525)', diff saved to https://phabricator.wikimedia.org/P27627 and previous config saved to /var/cache/conftool/dbconfig/20220505-055514-ladsgroup.json [05:55:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:20] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [05:59:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P27628 and previous config saved to /var/cache/conftool/dbconfig/20220505-055916-ladsgroup.json [05:59:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:00] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:00:04] kormat, marostegui, and Amir1: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220505T0600). [06:04:58] 10SRE, 10DBA, 10GlobalBlocking, 10Wikimedia-Incident: 2022-05-05 Wikimedia full site outage - https://phabricator.wikimedia.org/T307647 (10Legoktm) [06:09:38] (03PS34) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [06:10:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P27629 and previous config saved to /var/cache/conftool/dbconfig/20220505-061019-ladsgroup.json [06:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:28] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:11:18] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:11:38] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [06:11:51] 10SRE, 10SRE-OnFire (FY2021/2022-Q4), 10DBA, 10GlobalBlocking, 10Wikimedia-Incident: 2022-05-05 Wikimedia full site outage - https://phabricator.wikimedia.org/T307647 (10lmata) [06:14:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P27630 and previous config saved to /var/cache/conftool/dbconfig/20220505-061421-ladsgroup.json [06:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [06:19:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [06:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:52] (03PS35) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [06:25:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P27631 and previous config saved to /var/cache/conftool/dbconfig/20220505-062524-ladsgroup.json [06:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:35] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [06:25:40] (03CR) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [06:29:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T307525)', diff saved to https://phabricator.wikimedia.org/P27632 and previous config saved to /var/cache/conftool/dbconfig/20220505-062927-ladsgroup.json [06:29:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [06:29:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [06:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:32] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [06:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T307525)', diff saved to https://phabricator.wikimedia.org/P27633 and previous config saved to /var/cache/conftool/dbconfig/20220505-062935-ladsgroup.json [06:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T307525)', diff saved to https://phabricator.wikimedia.org/P27634 and previous config saved to /var/cache/conftool/dbconfig/20220505-063347-ladsgroup.json [06:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:50] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [06:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:40:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T307525)', diff saved to https://phabricator.wikimedia.org/P27635 and previous config saved to /var/cache/conftool/dbconfig/20220505-064029-ladsgroup.json [06:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [06:40:34] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [06:40:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [06:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 10 hosts with reason: Maintenance [06:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 10 hosts with reason: Maintenance [06:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:08] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [06:46:34] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:48:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P27636 and previous config saved to /var/cache/conftool/dbconfig/20220505-064852-ladsgroup.json [06:48:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:17] (03PS36) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [06:53:32] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:54:57] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [06:58:01] (03PS37) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [06:59:40] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [07:00:04] Amir1 and apergos: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220505T0700). [07:00:04] tgr: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:39] no one's signed up for training, only two patches in the window [07:01:12] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:01:28] tgr or tgr_ are you around? [07:01:38] hi apergos, I can self-deploy [07:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:02:03] I see that your changes involve multiple files (besides the testing changes) [07:02:07] (03CR) 10Majavah: P:openstack::puppetmaster: add 8143 to ferm rules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/788761 (owner: 10Majavah) [07:02:25] is there a specific order the files need to go in? because you can't count on that. [07:02:46] RECOVERY - dump of es5 in eqiad on alert1001 is OK: Last dump for es5 at eqiad (es1025) taken on 2022-05-04 08:12:53 (2986 GiB, +1.0 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [07:02:56] no, one of the files is a maintenance script [07:03:03] i.e. is the maintenance script something that will automatically be kicked off by... [07:03:10] ok I guess the answer to what I was typing is no [07:03:15] the go for it [07:03:37] (03CR) 10Gergő Tisza: [C: 03+2] Community configuration: Allow writing sub-fields programmatically [extensions/GrowthExperiments] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/789326 (https://phabricator.wikimedia.org/T306792) (owner: 10Gergő Tisza) [07:03:57] (03CR) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [07:03:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P27637 and previous config saved to /var/cache/conftool/dbconfig/20220505-070357-ladsgroup.json [07:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [07:04:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [07:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T307525)', diff saved to https://phabricator.wikimedia.org/P27638 and previous config saved to /var/cache/conftool/dbconfig/20220505-070422-ladsgroup.json [07:04:26] (03CR) 10Gergő Tisza: [C: 03+2] Community configuration: Allow writing sub-fields programmatically [extensions/GrowthExperiments] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/789185 (https://phabricator.wikimedia.org/T306792) (owner: 10Gergő Tisza) [07:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:27] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [07:07:10] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase[1018-1019].eqiad.wmnet with reason: reboot [07:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase[1018-1019].eqiad.wmnet with reason: reboot [07:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host restbase1018.eqiad.wmnet [07:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:14] (03Abandoned) 10Muehlenhoff: ganeti.addnode: Switch bridge detection to a check based on "ip" [cookbooks] - 10https://gerrit.wikimedia.org/r/786895 (owner: 10Muehlenhoff) [07:19:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T307525)', diff saved to https://phabricator.wikimedia.org/P27639 and previous config saved to /var/cache/conftool/dbconfig/20220505-071904-ladsgroup.json [07:19:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [07:19:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [07:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:09] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [07:19:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T307525)', diff saved to https://phabricator.wikimedia.org/P27640 and previous config saved to /var/cache/conftool/dbconfig/20220505-071911-ladsgroup.json [07:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [07:19:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [07:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:18] jouncebot: now [07:20:18] For the next 0 hour(s) and 39 minute(s): UTC morning backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220505T0700) [07:20:36] patience patience [07:20:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T307525)', diff saved to https://phabricator.wikimedia.org/P27641 and previous config saved to /var/cache/conftool/dbconfig/20220505-072038-ladsgroup.json [07:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:52] got something coming up? [07:21:04] yeah train at 8:00 UTC [07:21:26] but I am always fine delaying it if the backport window needs to be extended [07:21:36] it should not [07:21:43] two patches only, both config and almost merged [07:21:48] hashar: do you want to start earlier? I can force-merge the patches [07:21:55] no need [07:21:59] cause unlike real life trains, the mediawiki train can afford to arrive late :) [07:22:16] usually they take around 40 mins to merge [07:22:20] tgr: no no take your time :] [07:22:27] 40 minutes? !! [07:22:31] let's not forcemerge unless there's an emergency [07:22:54] they are at 18 minutes now and claim "0 minutes" left [07:23:07] I guess to be accurate they aren't mw config but growth experiment config-related [07:23:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T307525)', diff saved to https://phabricator.wikimedia.org/P27642 and previous config saved to /var/cache/conftool/dbconfig/20220505-072321-ladsgroup.json [07:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:29] yeah GrowthExperiments backports are sluggish [07:23:34] yeah there is a LARGE issue with the way we run tests [07:23:37] !log powercycling restbase1018 [07:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:56] (see T307180) [07:24:57] T307180: Drop Selenium tests from gate-and-submit-wmf - https://phabricator.wikimedia.org/T307180 [07:26:07] (03Merged) 10jenkins-bot: Community configuration: Allow writing sub-fields programmatically [extensions/GrowthExperiments] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/789326 (https://phabricator.wikimedia.org/T306792) (owner: 10Gergő Tisza) [07:26:16] one merged! [07:26:24] 21 minutes [07:26:36] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:27:26] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:28:04] I'll probably add a config patch if the script works as intended, but those are fast [07:30:25] (03CR) 10jerkins-bot: [V: 04-1] Community configuration: Allow writing sub-fields programmatically [extensions/GrowthExperiments] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/789185 (https://phabricator.wikimedia.org/T306792) (owner: 10Gergő Tisza) [07:30:52] oh come on [07:34:39] ParserIntegrationTest::testParse with data set "parserTests.txt: Bad images - basic functionality" [07:34:44] !log tgr@deploy1002 Synchronized php-1.39.0-wmf.10/extensions/GrowthExperiments: Backport: [[gerrit:789326|Community configuration: Allow writing sub-fields programmatically (T306792)]] (duration: 00m 54s) [07:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:50] T306792: initWikiConfig should set excludedSections for link-recommendation task type - https://phabricator.wikimedia.org/T306792 [07:35:03] I guess wmf.9 has a CI break, the test pipeline had the same error [07:35:14] I'll just force-merge then [07:35:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:40] (03CR) 10Gergő Tisza: [C: 03+2] "Error is unrelated:" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/789185 (https://phabricator.wikimedia.org/T306792) (owner: 10Gergő Tisza) [07:35:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P27643 and previous config saved to /var/cache/conftool/dbconfig/20220505-073543-ladsgroup.json [07:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:54] (03CR) 10Gergő Tisza: [V: 03+2 C: 03+2] Community configuration: Allow writing sub-fields programmatically [extensions/GrowthExperiments] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/789185 (https://phabricator.wikimedia.org/T306792) (owner: 10Gergő Tisza) [07:36:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:36:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:36:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P27644 and previous config saved to /var/cache/conftool/dbconfig/20220505-073826-ladsgroup.json [07:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:46] !log tgr@deploy1002 Synchronized php-1.39.0-wmf.9/extensions/GrowthExperiments: Backport: [[gerrit:789185|Community configuration: Allow writing sub-fields programmatically (T306792)]] (duration: 00m 52s) [07:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:02] !log running extensions/GrowthExperiments/maintenance/changeWikiConfig.php for T306792 [07:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:58] (03CR) 10Elukey: [C: 03+1] "LGTM too, the CI diff looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/788747 (https://phabricator.wikimedia.org/T301766) (owner: 10AikoChou) [07:41:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:42:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:18] 10SRE, 10SRE-OnFire (FY2021/2022-Q4), 10DBA, 10GlobalBlocking, 10Wikimedia-Incident: 2022-05-05 Wikimedia full site outage - https://phabricator.wikimedia.org/T307647 (10AlexisJazz) As the description was overwritten: it didn't break instantly and maybe it was never fully broken. For me it was extremely... [07:42:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:42:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:43:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:11] tgr how long should that maintenance script take to run do you think? [07:46:18] and is there any testing you need to do? [07:46:54] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:48:07] tgr: the ParserTestsIntegration tests failing is cause mediawiki/services/parsoid @ master is injected as a dependency [07:48:30] so anything that has changed there would no more match the parser tests expectation in core/extensions > build fails [07:48:36] which is hmm definitely annoying [07:49:19] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] clinic-duty: add Orange support [software] - 10https://gerrit.wikimedia.org/r/788296 (owner: 10Filippo Giunchedi) [07:49:57] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] clinic-duty: Combine the various conditionals in Message#work [software] - 10https://gerrit.wikimedia.org/r/789224 (owner: 10Krinkle) [07:50:08] RECOVERY - IPMI Sensor Status on restbase1018 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [07:50:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1018.eqiad.wmnet [07:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P27645 and previous config saved to /var/cache/conftool/dbconfig/20220505-075048-ladsgroup.json [07:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:52] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: port NEL alert to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/788720 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [07:53:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P27646 and previous config saved to /var/cache/conftool/dbconfig/20220505-075331-ladsgroup.json [07:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:06] apergos: hashar: I'll wrap up in 10 minutes or so, if it's okay to take a little of the train time [07:55:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:57:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [07:57:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [07:57:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T307525)', diff saved to https://phabricator.wikimedia.org/P27647 and previous config saved to /var/cache/conftool/dbconfig/20220505-075727-ladsgroup.json [07:57:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:31] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [08:00:04] hashar and brennen: (Dis)respected human, time to deploy MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220505T0800). Please do the needful. [08:01:03] tgr: I have poked the Parsoid team about the parserintegration test failure. We definitely have a task about it somewhere thought I can't find it right now :\ [08:02:34] (03PS1) 10Gergő Tisza: GrothExperiments: Enable Add Link backend on tier 3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789556 (https://phabricator.wikimedia.org/T304542) [08:04:40] !log UTC morning deploys done [08:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:54] thanks hashar! I'm done. [08:05:43] tgr: awesome [08:05:51] and sorry for the Parser test brekage [08:05:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T307525)', diff saved to https://phabricator.wikimedia.org/P27648 and previous config saved to /var/cache/conftool/dbconfig/20220505-080553-ladsgroup.json [08:05:54] breakage [08:05:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [08:05:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [08:05:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:58] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [08:05:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T307525)', diff saved to https://phabricator.wikimedia.org/P27649 and previous config saved to /var/cache/conftool/dbconfig/20220505-080606-ladsgroup.json [08:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:25] gush we really need those bots to be made quieter [08:08:01] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1042.eqiad.wmnet with OS bullseye [08:08:03] 10SRE, 10MediaWiki-Special-pages, 10Chinese-Sites: Special pages got no update for more than three days on Chinese Wikipedia - https://phabricator.wikimedia.org/T307644 (10Peachey88) [08:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:05] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-be1042.eqiad.wmnet with OS bullseye [08:08:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T307525)', diff saved to https://phabricator.wikimedia.org/P27650 and previous config saved to /var/cache/conftool/dbconfig/20220505-080836-ladsgroup.json [08:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [08:08:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [08:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T307525)', diff saved to https://phabricator.wikimedia.org/P27651 and previous config saved to /var/cache/conftool/dbconfig/20220505-080851-ladsgroup.json [08:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:14] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 98 probes of 669 (alerts on 90) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:12:05] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/788815 (https://phabricator.wikimedia.org/T307510) (owner: 10Bking) [08:12:15] logs looks right [08:12:22] going to promote all wikis to 1.39.0-wmf.10 [08:13:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T307525)', diff saved to https://phabricator.wikimedia.org/P27652 and previous config saved to /var/cache/conftool/dbconfig/20220505-081309-ladsgroup.json [08:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:15] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [08:15:06] (03PS1) 10Hashar: all wikis to 1.39.0-wmf.10 refs T305216 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789557 [08:15:08] (03CR) 10Hashar: [C: 03+2] all wikis to 1.39.0-wmf.10 refs T305216 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789557 (owner: 10Hashar) [08:15:46] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 61 probes of 669 (alerts on 90) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:15:46] (03Merged) 10jenkins-bot: all wikis to 1.39.0-wmf.10 refs T305216 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789557 (owner: 10Hashar) [08:17:10] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.39.0-wmf.10 refs T305216 [08:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:15] T305216: 1.39.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T305216 [08:17:56] (03PS1) 10Ladsgroup: Stop writing to temp actor table in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789558 (https://phabricator.wikimedia.org/T275246) [08:18:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:40] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1042.eqiad.wmnet with reason: host reimage [08:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:19:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:29] looks good :] [08:23:43] (03PS1) 10Gehel: tlsproxy: manage ssl_ecdhe_curve internally [puppet] - 10https://gerrit.wikimedia.org/r/789559 (https://phabricator.wikimedia.org/T307510) [08:24:13] (03CR) 10Gehel: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/789559 (https://phabricator.wikimedia.org/T307510) (owner: 10Gehel) [08:24:29] (03CR) 10jerkins-bot: [V: 04-1] tlsproxy: manage ssl_ecdhe_curve internally [puppet] - 10https://gerrit.wikimedia.org/r/789559 (https://phabricator.wikimedia.org/T307510) (owner: 10Gehel) [08:25:06] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1042.eqiad.wmnet with reason: host reimage [08:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T307525)', diff saved to https://phabricator.wikimedia.org/P27653 and previous config saved to /var/cache/conftool/dbconfig/20220505-082510-ladsgroup.json [08:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:15] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [08:26:33] (03PS2) 10Gehel: tlsproxy: manage ssl_ecdhe_curve internally [puppet] - 10https://gerrit.wikimedia.org/r/789559 (https://phabricator.wikimedia.org/T307510) [08:27:50] (03PS3) 10Gehel: tlsproxy: manage ssl_ecdhe_curve internally [puppet] - 10https://gerrit.wikimedia.org/r/789559 (https://phabricator.wikimedia.org/T307510) [08:28:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P27654 and previous config saved to /var/cache/conftool/dbconfig/20220505-082814-ladsgroup.json [08:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:28] (03CR) 10Gehel: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/789559 (https://phabricator.wikimedia.org/T307510) (owner: 10Gehel) [08:29:59] (03CR) 10jerkins-bot: [V: 04-1] tlsproxy: manage ssl_ecdhe_curve internally [puppet] - 10https://gerrit.wikimedia.org/r/789559 (https://phabricator.wikimedia.org/T307510) (owner: 10Gehel) [08:30:07] (03PS1) 10Func: [TOC] Remove pointer-events:none on .sidebar-toc-link [skins/Vector] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/789327 (https://phabricator.wikimedia.org/T307271) [08:30:48] (03PS1) 10Btullis: Temporarily disable gobblin [puppet] - 10https://gerrit.wikimedia.org/r/789560 (https://phabricator.wikimedia.org/T304938) [08:31:01] (03PS1) 10Vgutierrez: cache::text_haproxy: Add missing parsoid-rt-tests.wm.o to alternate domains [puppet] - 10https://gerrit.wikimedia.org/r/789561 (https://phabricator.wikimedia.org/T266509) [08:32:36] (03CR) 10Gehel: [C: 04-1] "Note that the Swift frontends seem to use a different logic to disable ecdhe curves. This change would enable ecdhe for Swift, which is pr" [puppet] - 10https://gerrit.wikimedia.org/r/789559 (https://phabricator.wikimedia.org/T307510) (owner: 10Gehel) [08:33:10] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35096/console" [puppet] - 10https://gerrit.wikimedia.org/r/789560 (https://phabricator.wikimedia.org/T304938) (owner: 10Btullis) [08:33:42] (03PS4) 10Gehel: tlsproxy: manage ssl_ecdhe_curve internally [puppet] - 10https://gerrit.wikimedia.org/r/789559 (https://phabricator.wikimedia.org/T307510) [08:35:48] (03CR) 10Vgutierrez: [C: 03+2] cache::text_haproxy: Add missing parsoid-rt-tests.wm.o to alternate domains [puppet] - 10https://gerrit.wikimedia.org/r/789561 (https://phabricator.wikimedia.org/T266509) (owner: 10Vgutierrez) [08:37:15] (03PS1) 10Ladsgroup: Set cebwiki to read new in templatelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789562 (https://phabricator.wikimedia.org/T306673) [08:38:31] (03CR) 10Btullis: [V: 03+1 C: 03+2] Temporarily disable gobblin [puppet] - 10https://gerrit.wikimedia.org/r/789560 (https://phabricator.wikimedia.org/T304938) (owner: 10Btullis) [08:40:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P27655 and previous config saved to /var/cache/conftool/dbconfig/20220505-084015-ladsgroup.json [08:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:59] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:43:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P27656 and previous config saved to /var/cache/conftool/dbconfig/20220505-084320-ladsgroup.json [08:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:47] PROBLEM - Check for large files in client bucket on an-launcher1002 is CRITICAL: WARNING: large files in client bucket https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [08:47:02] 10SRE, 10Traffic, 10Patch-For-Review, 10Upstream: HAProxy 2.4.16 shows internal errors on text cluster - https://phabricator.wikimedia.org/T307444 (10Vgutierrez) Issue fixed by upstream on https://git.haproxy.org/?p=haproxy-2.4.git;a=commit;h=f9a0f51d3bfa37993935754508e7c88b2e69c9ed [08:48:00] (03PS4) 10Jaime Nuche: scap: add system package requirements for scap [puppet] - 10https://gerrit.wikimedia.org/r/789147 (https://phabricator.wikimedia.org/T306991) [08:49:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T307525)', diff saved to https://phabricator.wikimedia.org/P27657 and previous config saved to /var/cache/conftool/dbconfig/20220505-084917-ladsgroup.json [08:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:22] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [08:49:31] tgr: I have created a `wmf/1.39.0-wmf.10` branch on the `mediawiki/services/parsoid` repo which should fix the ParserIntegrationTest failures which were blocking your backports this morning [08:49:41] a more robust / automatic solution will have to be found though [08:51:18] (03CR) 10Jaime Nuche: scap: add system package requirements for scap (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/789147 (https://phabricator.wikimedia.org/T306991) (owner: 10Jaime Nuche) [08:52:25] thanks hashar! though now that everything is on wmf.10, a wmf.9 CI break wouldn't matter much anyway [08:53:30] but some sort of automatic branching of that repo would be nice [08:53:35] (03CR) 10Muehlenhoff: [C: 03+2] Remove webperf1001/webperf2001 from Kafka Ferm config [puppet] - 10https://gerrit.wikimedia.org/r/789084 (https://phabricator.wikimedia.org/T305460) (owner: 10Muehlenhoff) [08:54:21] I guess this applies to other MediaWiki libraries as well, they just change much less often [08:55:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P27658 and previous config saved to /var/cache/conftool/dbconfig/20220505-085520-ladsgroup.json [08:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:55] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (in case it gets confirmed that no shell access is needed)" [puppet] - 10https://gerrit.wikimedia.org/r/789288 (https://phabricator.wikimedia.org/T307582) (owner: 10JHathaway) [08:57:22] tgr|away: I think we did a tag based automatic branching for CentralAuth [08:57:28] we will see :] [08:58:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T307525)', diff saved to https://phabricator.wikimedia.org/P27659 and previous config saved to /var/cache/conftool/dbconfig/20220505-085825-ladsgroup.json [08:58:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [08:58:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [08:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:30] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [08:58:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T307525)', diff saved to https://phabricator.wikimedia.org/P27660 and previous config saved to /var/cache/conftool/dbconfig/20220505-085833-ladsgroup.json [08:58:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:07] I'm about to restart an-coord1001 - I have tried to silence things in advance, but there may be some alert noise. Apologies in advance. [08:59:24] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1042.eqiad.wmnet with OS bullseye [08:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:29] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-be1042.eqiad.wmnet with OS bullseye completed: - ms-be1042 (**PASS**) - Downtim... [09:00:20] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-coord1001.eqiad.wmnet [09:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:48] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/789147 (https://phabricator.wikimedia.org/T306991) (owner: 10Jaime Nuche) [09:04:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P27661 and previous config saved to /var/cache/conftool/dbconfig/20220505-090422-ladsgroup.json [09:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:41] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-coord1001.eqiad.wmnet [09:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:27] (03PS1) 10Btullis: Re-enable gobblin jobs [puppet] - 10https://gerrit.wikimedia.org/r/789567 (https://phabricator.wikimedia.org/T304938) [09:10:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T307525)', diff saved to https://phabricator.wikimedia.org/P27662 and previous config saved to /var/cache/conftool/dbconfig/20220505-091025-ladsgroup.json [09:10:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [09:10:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [09:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:31] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [09:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T307525)', diff saved to https://phabricator.wikimedia.org/P27663 and previous config saved to /var/cache/conftool/dbconfig/20220505-091033-ladsgroup.json [09:10:35] (03CR) 10Muehlenhoff: elastic: enable/disable ssl_ecdhe_curve based on OS version (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/788815 (https://phabricator.wikimedia.org/T307510) (owner: 10Bking) [09:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T307525)', diff saved to https://phabricator.wikimedia.org/P27664 and previous config saved to /var/cache/conftool/dbconfig/20220505-091348-ladsgroup.json [09:13:49] 10SRE, 10Performance-Team, 10Patch-For-Review: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460 (10MoritzMuehlenhoff) [09:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:16] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1019.eqiad.wmnet with reason: reboot [09:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1019.eqiad.wmnet with reason: reboot [09:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:52] (03CR) 10Btullis: [C: 03+2] Re-enable gobblin jobs [puppet] - 10https://gerrit.wikimedia.org/r/789567 (https://phabricator.wikimedia.org/T304938) (owner: 10Btullis) [09:16:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host restbase1019.eqiad.wmnet [09:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P27665 and previous config saved to /var/cache/conftool/dbconfig/20220505-091927-ladsgroup.json [09:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1019.eqiad.wmnet [09:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T307525)', diff saved to https://phabricator.wikimedia.org/P27666 and previous config saved to /var/cache/conftool/dbconfig/20220505-092651-ladsgroup.json [09:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:56] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [09:28:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P27667 and previous config saved to /var/cache/conftool/dbconfig/20220505-092853-ladsgroup.json [09:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:22] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase[1020-1023].eqiad.wmnet with reason: reboot [09:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase[1020-1023].eqiad.wmnet with reason: reboot [09:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:20] (03PS4) 10Vgutierrez: mtail::cache_haproxy: Track termination state per request [puppet] - 10https://gerrit.wikimedia.org/r/789108 (https://phabricator.wikimedia.org/T307444) [09:34:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T307525)', diff saved to https://phabricator.wikimedia.org/P27668 and previous config saved to /var/cache/conftool/dbconfig/20220505-093432-ladsgroup.json [09:34:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [09:34:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [09:34:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:37] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [09:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 5%: After the incident', diff saved to https://phabricator.wikimedia.org/P27669 and previous config saved to /var/cache/conftool/dbconfig/20220505-093543-root.json [09:35:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:22] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host restbase1020.eqiad.wmnet [09:39:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:20] (03PS1) 10Slyngshede: Bug: T123456 - Update statistics::rsync::published to use SystemD timers [puppet] - 10https://gerrit.wikimedia.org/r/789570 (https://phabricator.wikimedia.org/T123456) [09:40:53] (03CR) 10jerkins-bot: [V: 04-1] Bug: T123456 - Update statistics::rsync::published to use SystemD timers [puppet] - 10https://gerrit.wikimedia.org/r/789570 (https://phabricator.wikimedia.org/T123456) (owner: 10Slyngshede) [09:41:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P27670 and previous config saved to /var/cache/conftool/dbconfig/20220505-094156-ladsgroup.json [09:41:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P27671 and previous config saved to /var/cache/conftool/dbconfig/20220505-094358-ladsgroup.json [09:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1020.eqiad.wmnet [09:45:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:22] (03CR) 10Vgutierrez: [C: 03+2] mtail::cache_haproxy: Track termination state per request [puppet] - 10https://gerrit.wikimedia.org/r/789108 (https://phabricator.wikimedia.org/T307444) (owner: 10Vgutierrez) [09:49:20] (03PS2) 10Slyngshede: Update statistics::rsync::published to use SystemD timers [puppet] - 10https://gerrit.wikimedia.org/r/789570 (https://phabricator.wikimedia.org/T123456) [09:50:03] (03PS6) 10Jaime Nuche: scap: add new `scap` user to deployment hosts and scap targets [puppet] - 10https://gerrit.wikimedia.org/r/789146 (https://phabricator.wikimedia.org/T306991) [09:50:17] (03CR) 10Jaime Nuche: scap: add new `scap` user to deployment hosts and scap targets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/789146 (https://phabricator.wikimedia.org/T306991) (owner: 10Jaime Nuche) [09:50:28] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host restbase1021.eqiad.wmnet [09:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 10%: After the incident', diff saved to https://phabricator.wikimedia.org/P27672 and previous config saved to /var/cache/conftool/dbconfig/20220505-095049-root.json [09:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:51:50] (03PS1) 10Ayounsi: Update Makefile for Bullseye support [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/789571 [09:51:52] (03PS1) 10Ayounsi: Update requirements and artifacts for bullseye [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/789572 [09:53:47] !log btullis@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop analytics cluster: Restart of jvm daemons. [09:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:43] (03PS2) 10Ayounsi: Update Makefile for Bullseye support [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/789571 [09:54:45] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:54:45] (03PS2) 10Ayounsi: Update requirements and artifacts for bullseye [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/789572 [09:56:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1021.eqiad.wmnet [09:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P27673 and previous config saved to /var/cache/conftool/dbconfig/20220505-095701-ladsgroup.json [09:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:08] (03CR) 10Ayounsi: "There is a (non blocking) bug I can't figure out." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/789571 (owner: 10Ayounsi) [09:59:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T307525)', diff saved to https://phabricator.wikimedia.org/P27674 and previous config saved to /var/cache/conftool/dbconfig/20220505-095903-ladsgroup.json [09:59:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [09:59:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [09:59:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:08] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [09:59:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [09:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [09:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T307525)', diff saved to https://phabricator.wikimedia.org/P27675 and previous config saved to /var/cache/conftool/dbconfig/20220505-095917-ladsgroup.json [09:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:04] mvolz: #bothumor I � Unicode. All rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220505T1000). [10:00:29] 10SRE-swift-storage, 10ops-eqiad: Power drain and restart of ms-be1059 - https://phabricator.wikimedia.org/T307667 (10MatthewVernon) [10:00:43] 10SRE-swift-storage, 10ops-eqiad: Power drain and restart of ms-be1059 - https://phabricator.wikimedia.org/T307667 (10MatthewVernon) [10:00:45] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon) [10:03:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T307525)', diff saved to https://phabricator.wikimedia.org/P27676 and previous config saved to /var/cache/conftool/dbconfig/20220505-100329-ladsgroup.json [10:03:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:48] (03CR) 10Mvolz: [C: 03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/788705 (owner: 10PipelineBot) [10:05:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 25%: After the incident', diff saved to https://phabricator.wikimedia.org/P27677 and previous config saved to /var/cache/conftool/dbconfig/20220505-100553-root.json [10:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:23] RECOVERY - dump of es4 in eqiad on alert1001 is OK: Last dump for es4 at eqiad (es1022) taken on 2022-05-03 14:34:47 (3007 GiB, +1.0 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [10:08:39] (03CR) 10Giuseppe Lavagetto: requestctl: set an X-Requestctl header for matching rules (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/787437 (https://phabricator.wikimedia.org/T305582) (owner: 10Giuseppe Lavagetto) [10:09:22] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/788705 (owner: 10PipelineBot) [10:09:34] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be2049.codfw.wmnet with OS bullseye [10:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:38] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-be2049.codfw.wmnet with OS bullseye [10:10:04] (03PS1) 10Jbond: wmflib: add drmrs site [puppet] - 10https://gerrit.wikimedia.org/r/789574 [10:11:13] 10SRE-OnFire, 10DBA, 10Blocked-on-schema-change: Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501 (10Marostegui) Checking the optimizer output for this query: ` SELECT /* MediaWiki\Extension\GlobalBlocking\GlobalBlocking::getGl... [10:12:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T307525)', diff saved to https://phabricator.wikimedia.org/P27678 and previous config saved to /var/cache/conftool/dbconfig/20220505-101206-ladsgroup.json [10:12:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [10:12:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [10:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:11] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [10:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T307525)', diff saved to https://phabricator.wikimedia.org/P27679 and previous config saved to /var/cache/conftool/dbconfig/20220505-101214-ladsgroup.json [10:12:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [10:13:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [10:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T307525)', diff saved to https://phabricator.wikimedia.org/P27680 and previous config saved to /var/cache/conftool/dbconfig/20220505-101400-ladsgroup.json [10:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:19] (03CR) 10Giuseppe Lavagetto: requestctl: Allow detecting matching rules that are disabled (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/787438 (https://phabricator.wikimedia.org/T305582) (owner: 10Giuseppe Lavagetto) [10:17:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host restbase1022.eqiad.wmnet [10:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P27681 and previous config saved to /var/cache/conftool/dbconfig/20220505-101835-ladsgroup.json [10:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:28] (03PS1) 10Filippo Giunchedi: sre: add varnish/haproxy availability pages [alerts] - 10https://gerrit.wikimedia.org/r/789575 (https://phabricator.wikimedia.org/T305847) [10:20:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 50%: After the incident', diff saved to https://phabricator.wikimedia.org/P27682 and previous config saved to /var/cache/conftool/dbconfig/20220505-102056-root.json [10:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:45] (03CR) 10Filippo Giunchedi: "I couldn't find logs and runbooks for haproxy, hence the TODOs there. Let me know what you think!" [alerts] - 10https://gerrit.wikimedia.org/r/789575 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [10:23:34] 10SRE, 10SRE-OnFire (FY2021/2022-Q4), 10DBA, 10GlobalBlocking, 10Wikimedia-Incident: 2022-05-05 Wikimedia full site outage - https://phabricator.wikimedia.org/T307647 (10Marostegui) Updating this task - cross posting from the original schema change task: This query seemed to be the one that got stuck ` S... [10:23:39] 10SRE, 10SRE-OnFire (FY2021/2022-Q4), 10DBA, 10GlobalBlocking, 10Wikimedia-Incident: 2022-05-05 Wikimedia full site outage - https://phabricator.wikimedia.org/T307647 (10Marostegui) [10:23:43] 10SRE-OnFire, 10DBA, 10Blocked-on-schema-change: Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501 (10Marostegui) [10:24:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1022.eqiad.wmnet [10:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:08] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2049.codfw.wmnet with reason: host reimage [10:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:57] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hadoop.roll-restart-masters (exit_code=99) restart masters for Hadoop analytics cluster: Restart of jvm daemons. [10:26:58] (03PS2) 10Roman Stolar: Migrate tests from nose to pytest [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/789163 (https://phabricator.wikimedia.org/T303866) [10:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:40] (03CR) 10Roman Stolar: Migrate tests from nose to pytest (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/789163 (https://phabricator.wikimedia.org/T303866) (owner: 10Roman Stolar) [10:28:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host restbase1023.eqiad.wmnet [10:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1127', diff saved to https://phabricator.wikimedia.org/P27683 and previous config saved to /var/cache/conftool/dbconfig/20220505-102817-marostegui.json [10:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:26] 10SRE, 10SRE-OnFire (FY2021/2022-Q4), 10DBA, 10GlobalBlocking, 10Wikimedia-Incident: 2022-05-05 Wikimedia full site outage - https://phabricator.wikimedia.org/T307647 (10jcrespo) @AlexisJazz Full doc will come later, but for clarification, the impact was the following: * Cached requests (anonymous users... [10:29:34] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2049.codfw.wmnet with reason: host reimage [10:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:10] !log Alter globalblocks on db1127 T307501 [10:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:14] T307501: Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501 [10:31:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1127 with low weight', diff saved to https://phabricator.wikimedia.org/P27684 and previous config saved to /var/cache/conftool/dbconfig/20220505-103111-marostegui.json [10:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:39] 10SRE-OnFire, 10DBA, 10Blocked-on-schema-change: Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501 (10Marostegui) Altered db1127 and pooled it with very low weight to see if there's some slowness there [10:33:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P27685 and previous config saved to /var/cache/conftool/dbconfig/20220505-103340-ladsgroup.json [10:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T307525)', diff saved to https://phabricator.wikimedia.org/P27686 and previous config saved to /var/cache/conftool/dbconfig/20220505-103419-ladsgroup.json [10:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:23] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [10:34:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1023.eqiad.wmnet [10:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:10] (03PS2) 10Jbond: wmflib: add drmrs site [puppet] - 10https://gerrit.wikimedia.org/r/789574 [10:36:12] (03PS1) 10Jbond: Wmflib: add new function mapped to URI.decode_www_form [puppet] - 10https://gerrit.wikimedia.org/r/789579 [10:36:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host build2001.codfw.wmnet [10:36:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:45] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase[1024-1026].eqiad.wmnet with reason: reboot [10:36:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase[1024-1026].eqiad.wmnet with reason: reboot [10:36:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1127 with low weight', diff saved to https://phabricator.wikimedia.org/P27687 and previous config saved to /var/cache/conftool/dbconfig/20220505-103723-marostegui.json [10:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:52] (03CR) 10jerkins-bot: [V: 04-1] Wmflib: add new function mapped to URI.decode_www_form [puppet] - 10https://gerrit.wikimedia.org/r/789579 (owner: 10Jbond) [10:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:39:42] (03PS2) 10Jbond: Wmflib: add new function mapped to URI.decode_www_form [puppet] - 10https://gerrit.wikimedia.org/r/789579 [10:39:57] (03CR) 10Jbond: [C: 03+2] wmflib: add drmrs site [puppet] - 10https://gerrit.wikimedia.org/r/789574 (owner: 10Jbond) [10:41:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host restbase1024.eqiad.wmnet [10:42:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:49] (03CR) 10Jbond: [C: 03+2] Wmflib: add new function mapped to URI.decode_www_form [puppet] - 10https://gerrit.wikimedia.org/r/789579 (owner: 10Jbond) [10:42:56] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:43:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2001.codfw.wmnet [10:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:16] (03CR) 10Jbond: "updated and ready for a new review" [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond) [10:44:36] (03CR) 10jerkins-bot: [V: 04-1] prometheus::blackbox::check: add new blackbox exporter check [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond) [10:44:43] (03PS8) 10Jbond: prometheus::blackbox::check: add new blackbox exporter check [puppet] - 10https://gerrit.wikimedia.org/r/787067 [10:45:03] (03PS9) 10Jbond: prometheus::blackbox::check: add new blackbox exporter check [puppet] - 10https://gerrit.wikimedia.org/r/787067 [10:45:13] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2049.codfw.wmnet with OS bullseye [10:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:17] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-be2049.codfw.wmnet with OS bullseye completed: - ms-be2049 (**PASS**) - Downtim... [10:45:47] (03CR) 10jerkins-bot: [V: 04-1] prometheus::blackbox::check: add new blackbox exporter check [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond) [10:46:49] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon) [10:47:14] 10SRE, 10SRE-swift-storage, 10ops-eqiad: Power drain and restart of ms-be1059 - https://phabricator.wikimedia.org/T307667 (10MatthewVernon) p:05Triage→03High [10:48:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1024.eqiad.wmnet [10:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:37] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon) Broadly, there should be little impact (and our monitoring suggests error rates within expected ranges); I hope any errors were infrequent and transient. Unfortuna... [10:48:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T307525)', diff saved to https://phabricator.wikimedia.org/P27688 and previous config saved to /var/cache/conftool/dbconfig/20220505-104845-ladsgroup.json [10:48:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [10:48:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [10:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:49] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [10:48:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T307525)', diff saved to https://phabricator.wikimedia.org/P27689 and previous config saved to /var/cache/conftool/dbconfig/20220505-104853-ladsgroup.json [10:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P27690 and previous config saved to /var/cache/conftool/dbconfig/20220505-104924-ladsgroup.json [10:49:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:26] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon) eqiad blocked on T307667 ms-be1059 being broken; that's unrelated to the reimages (it's still on stretch), but still blocks us as it's only safe to have one host ou... [10:50:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase db1127 weight', diff saved to https://phabricator.wikimedia.org/P27691 and previous config saved to /var/cache/conftool/dbconfig/20220505-105049-marostegui.json [10:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T307525)', diff saved to https://phabricator.wikimedia.org/P27692 and previous config saved to /var/cache/conftool/dbconfig/20220505-105316-ladsgroup.json [10:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host restbase1025.eqiad.wmnet [10:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host people2002.codfw.wmnet [10:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:55] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for Ariel Gutman - https://phabricator.wikimedia.org/T307582 (10AGutman-WMF) @jhathaway to be honest, I'm not sure. Maybe @cmassaro would now. [10:57:30] (03PS2) 10Giuseppe Lavagetto: requestctl: set an X-Requestctl header for matching rules [software/conftool] - 10https://gerrit.wikimedia.org/r/787437 (https://phabricator.wikimedia.org/T305582) [10:57:32] (03PS2) 10Giuseppe Lavagetto: requestctl: Allow detecting matching rules that are disabled [software/conftool] - 10https://gerrit.wikimedia.org/r/787438 (https://phabricator.wikimedia.org/T305582) [10:57:34] (03PS2) 10Giuseppe Lavagetto: reqestctl: add unit tests for grammar parsing [software/conftool] - 10https://gerrit.wikimedia.org/r/789153 (https://phabricator.wikimedia.org/T305607) [10:57:36] (03PS3) 10Giuseppe Lavagetto: requestctl: add AND NOT and OR NOT to the parsing grammar [software/conftool] - 10https://gerrit.wikimedia.org/r/789154 (https://phabricator.wikimedia.org/T305607) [10:59:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people2002.codfw.wmnet [10:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:02:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host people1003.eqiad.wmnet [11:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1025.eqiad.wmnet [11:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:39] (03PS1) 10Marostegui: db2122: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/789581 [11:04:20] (03CR) 10Marostegui: [C: 03+2] db2122: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/789581 (owner: 10Marostegui) [11:04:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people1003.eqiad.wmnet [11:04:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P27693 and previous config saved to /var/cache/conftool/dbconfig/20220505-110429-ladsgroup.json [11:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host theemin.codfw.wmnet [11:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P27694 and previous config saved to /var/cache/conftool/dbconfig/20220505-110821-ladsgroup.json [11:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host theemin.codfw.wmnet [11:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:15] (03CR) 10Muehlenhoff: Update statistics::rsync::published to use SystemD timers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/789570 (https://phabricator.wikimedia.org/T123456) (owner: 10Slyngshede) [11:09:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase db1127 weight', diff saved to https://phabricator.wikimedia.org/P27695 and previous config saved to /var/cache/conftool/dbconfig/20220505-110940-marostegui.json [11:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T307525)', diff saved to https://phabricator.wikimedia.org/P27696 and previous config saved to /var/cache/conftool/dbconfig/20220505-111228-ladsgroup.json [11:12:30] (03PS3) 10Slyngshede: Update statistics::rsync::published to use SystemD timers [puppet] - 10https://gerrit.wikimedia.org/r/789570 (https://phabricator.wikimedia.org/T123456) [11:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:33] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [11:18:05] (03PS1) 10Majavah: toolsdb: add gtid_domain_id [puppet] - 10https://gerrit.wikimedia.org/r/789588 (https://phabricator.wikimedia.org/T301993) [11:19:34] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35099/console" [puppet] - 10https://gerrit.wikimedia.org/r/789588 (https://phabricator.wikimedia.org/T301993) (owner: 10Majavah) [11:19:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T307525)', diff saved to https://phabricator.wikimedia.org/P27697 and previous config saved to /var/cache/conftool/dbconfig/20220505-111934-ladsgroup.json [11:19:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [11:19:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [11:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [11:19:40] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [11:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [11:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T307525)', diff saved to https://phabricator.wikimedia.org/P27698 and previous config saved to /var/cache/conftool/dbconfig/20220505-111947-ladsgroup.json [11:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:21] (03PS1) 10Muehlenhoff: Enable repo sync for node14 [puppet] - 10https://gerrit.wikimedia.org/r/789592 (https://phabricator.wikimedia.org/T306996) [11:21:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host restbase1026.eqiad.wmnet [11:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:10] (03PS1) 10Ayounsi: Update requirements and artifacts for bullseye [software/netbox-deploy] (2-10-4-bullseye) - 10https://gerrit.wikimedia.org/r/789596 (https://phabricator.wikimedia.org/T296452) [11:23:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P27699 and previous config saved to /var/cache/conftool/dbconfig/20220505-112326-ladsgroup.json [11:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:34] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host gerrit2002.wikimedia.org [11:24:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:01] (03Abandoned) 10Ayounsi: Update requirements and artifacts for bullseye [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/789572 (owner: 10Ayounsi) [11:27:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P27700 and previous config saved to /var/cache/conftool/dbconfig/20220505-112733-ladsgroup.json [11:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1026.eqiad.wmnet [11:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gerrit2002.wikimedia.org [11:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase db1127 weight', diff saved to https://phabricator.wikimedia.org/P27701 and previous config saved to /var/cache/conftool/dbconfig/20220505-113006-marostegui.json [11:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:10] (03PS1) 10Cathal Mooney: Minor fixes to ASW EVPN templates [homer/public] - 10https://gerrit.wikimedia.org/r/789597 (https://phabricator.wikimedia.org/T299758) [11:33:54] (03CR) 10Filippo Giunchedi: "Thanks for the heads up, ATM I don't have the bandwidth to meaningfully vote on this (bug LGTM overall, practically all swift frontends ar" [puppet] - 10https://gerrit.wikimedia.org/r/789559 (https://phabricator.wikimedia.org/T307510) (owner: 10Gehel) [11:34:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T307525)', diff saved to https://phabricator.wikimedia.org/P27702 and previous config saved to /var/cache/conftool/dbconfig/20220505-113412-ladsgroup.json [11:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:17] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [11:35:41] (03CR) 10Marostegui: [C: 03+1] toolsdb: add gtid_domain_id [puppet] - 10https://gerrit.wikimedia.org/r/789588 (https://phabricator.wikimedia.org/T301993) (owner: 10Majavah) [11:35:54] (03PS1) 10Marostegui: Revert "db2122: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/789331 [11:36:42] (03CR) 10Marostegui: [C: 03+2] Revert "db2122: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/789331 (owner: 10Marostegui) [11:37:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1127', diff saved to https://phabricator.wikimedia.org/P27703 and previous config saved to /var/cache/conftool/dbconfig/20220505-113711-marostegui.json [11:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:43] 10SRE-OnFire, 10DBA, 10Blocked-on-schema-change: Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501 (10Marostegui) db1127 is now fully repooled, and I am not seeing any locks or high latency for those queries (or any). Still checki... [11:38:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T307525)', diff saved to https://phabricator.wikimedia.org/P27704 and previous config saved to /var/cache/conftool/dbconfig/20220505-113831-ladsgroup.json [11:38:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [11:38:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [11:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T307525)', diff saved to https://phabricator.wikimedia.org/P27705 and previous config saved to /var/cache/conftool/dbconfig/20220505-113839-ladsgroup.json [11:38:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:47] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase[1027-1029].eqiad.wmnet with reason: reboot [11:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase[1027-1029].eqiad.wmnet with reason: reboot [11:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:33] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be2050.codfw.wmnet with OS bullseye [11:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:37] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-be2050.codfw.wmnet with OS bullseye [11:42:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P27706 and previous config saved to /var/cache/conftool/dbconfig/20220505-114238-ladsgroup.json [11:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:14] 10SRE-OnFire, 10DBA, 10Blocked-on-schema-change: Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501 (10Marostegui) As another test I have altered db1101:3317 without depooling it (to sort of simulate what happened earlier today whe... [11:44:05] 10SRE, 10SRE-OnFire (FY2021/2022-Q4), 10DBA, 10GlobalBlocking, 10Wikimedia-Incident: 2022-05-05 Wikimedia full site outage - https://phabricator.wikimedia.org/T307647 (10Marostegui) A most recent test on one of the most affected hosts during the outage, does show a different query plan: https://phabrica... [11:44:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host webperf1003.eqiad.wmnet [11:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host restbase1027.eqiad.wmnet [11:44:31] 10SRE-OnFire, 10DBA, 10Blocked-on-schema-change: Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501 (10Marostegui) Reverted the schema change on db1127 too (as the query plan changes there as well). [11:44:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf1003.eqiad.wmnet [11:46:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:54] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:47:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1127', diff saved to https://phabricator.wikimedia.org/P27707 and previous config saved to /var/cache/conftool/dbconfig/20220505-114712-marostegui.json [11:47:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host webperf1004.eqiad.wmnet [11:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P27708 and previous config saved to /var/cache/conftool/dbconfig/20220505-114917-ladsgroup.json [11:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:00] (03CR) 10MVernon: [C: 04-1] "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/789559 (https://phabricator.wikimedia.org/T307510) (owner: 10Gehel) [11:50:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf1004.eqiad.wmnet [11:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1027.eqiad.wmnet [11:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host webperf2003.codfw.wmnet [11:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:56:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host restbase1028.eqiad.wmnet [11:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T307525)', diff saved to https://phabricator.wikimedia.org/P27709 and previous config saved to /var/cache/conftool/dbconfig/20220505-115743-ladsgroup.json [11:57:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [11:57:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [11:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:48] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [11:57:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T307525)', diff saved to https://phabricator.wikimedia.org/P27710 and previous config saved to /var/cache/conftool/dbconfig/20220505-115751-ladsgroup.json [11:57:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf2003.codfw.wmnet [11:58:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T307525)', diff saved to https://phabricator.wikimedia.org/P27711 and previous config saved to /var/cache/conftool/dbconfig/20220505-115844-ladsgroup.json [11:58:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host webperf2004.codfw.wmnet [11:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 5%: After the incident', diff saved to https://phabricator.wikimedia.org/P27712 and previous config saved to /var/cache/conftool/dbconfig/20220505-115901-root.json [11:59:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:06] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2050.codfw.wmnet with reason: host reimage [11:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf2004.codfw.wmnet [12:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:51] 10SRE, 10SRE-Access-Requests, 10Generated Data Platform: Request to add user fkaelin to analytics-platform-eng-admins group - https://phabricator.wikimedia.org/T307573 (10WDoranWMF) [12:02:05] 10SRE, 10SRE-Access-Requests, 10Generated Data Platform: Request to add user fkaelin to analytics-platform-eng-admins group - https://phabricator.wikimedia.org/T307573 (10WDoranWMF) @jhathaway done, thanks [12:02:31] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2050.codfw.wmnet with reason: host reimage [12:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1028.eqiad.wmnet [12:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P27713 and previous config saved to /var/cache/conftool/dbconfig/20220505-120422-ladsgroup.json [12:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:36] 10SRE-OnFire, 10DBA, 10Blocked-on-schema-change: Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501 (10Marostegui) This looks like a classic optimizer bug, when sometimes it picks the right index and sometimes it doesn't: Altering... [12:07:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host restbase1029.eqiad.wmnet [12:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:12] (03PS1) 10Slyngshede: Update statistics::publishd to use SystemD timers, rather than cron. [puppet] - 10https://gerrit.wikimedia.org/r/789599 (https://phabricator.wikimedia.org/T273673) [12:10:46] (03CR) 10jerkins-bot: [V: 04-1] Update statistics::publishd to use SystemD timers, rather than cron. [puppet] - 10https://gerrit.wikimedia.org/r/789599 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [12:11:26] PROBLEM - SSH on ms-be1059.mgmt is CRITICAL: connect to address 10.65.5.18 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:13:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P27714 and previous config saved to /var/cache/conftool/dbconfig/20220505-121349-ladsgroup.json [12:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 10%: After the incident', diff saved to https://phabricator.wikimedia.org/P27715 and previous config saved to /var/cache/conftool/dbconfig/20220505-121405-root.json [12:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:36] (03PS2) 10Slyngshede: Update statistics::publishd to use SystemD timers, rather than cron. [puppet] - 10https://gerrit.wikimedia.org/r/789599 (https://phabricator.wikimedia.org/T273673) [12:15:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1029.eqiad.wmnet [12:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:19] (03CR) 10jerkins-bot: [V: 04-1] Update statistics::publishd to use SystemD timers, rather than cron. [puppet] - 10https://gerrit.wikimedia.org/r/789599 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [12:15:51] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase[1030-1032].eqiad.wmnet with reason: reboot [12:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase[1030-1032].eqiad.wmnet with reason: reboot [12:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:22] (03PS3) 10Slyngshede: Update statistics::publishd to use SystemD timers, rather than cron. [puppet] - 10https://gerrit.wikimedia.org/r/789599 (https://phabricator.wikimedia.org/T273673) [12:17:05] (03CR) 10jerkins-bot: [V: 04-1] Update statistics::publishd to use SystemD timers, rather than cron. [puppet] - 10https://gerrit.wikimedia.org/r/789599 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [12:17:39] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab1003.wikimedia.org [12:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:15] (03PS4) 10Slyngshede: Update statistics::publishd to use SystemD timers, rather than cron. [puppet] - 10https://gerrit.wikimedia.org/r/789599 (https://phabricator.wikimedia.org/T273673) [12:19:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T307525)', diff saved to https://phabricator.wikimedia.org/P27716 and previous config saved to /var/cache/conftool/dbconfig/20220505-121928-ladsgroup.json [12:19:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [12:19:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [12:19:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T307525)', diff saved to https://phabricator.wikimedia.org/P27717 and previous config saved to /var/cache/conftool/dbconfig/20220505-121935-ladsgroup.json [12:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:38] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [12:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host restbase1030.eqiad.wmnet [12:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:32] (03PS5) 10Gehel: tlsproxy: manage ssl_ecdhe_curve internally [puppet] - 10https://gerrit.wikimedia.org/r/789559 (https://phabricator.wikimedia.org/T307510) [12:20:45] (03CR) 10Gehel: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/789559 (https://phabricator.wikimedia.org/T307510) (owner: 10Gehel) [12:22:25] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:25:35] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab1003.wikimedia.org [12:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:24] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2050.codfw.wmnet with OS bullseye [12:26:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:28] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-be2050.codfw.wmnet with OS bullseye completed: - ms-be2050 (**WARN**) - Downtim... [12:27:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1030.eqiad.wmnet [12:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:09] !log Regular analytics weekly train [analytics/refinery@cc4b2bd] [12:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:20] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab1004.wikimedia.org [12:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:20] (03PS3) 10Jbond: Update Makefile for Bullseye support [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/789571 (owner: 10Ayounsi) [12:28:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P27718 and previous config saved to /var/cache/conftool/dbconfig/20220505-122854-ladsgroup.json [12:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 25%: After the incident', diff saved to https://phabricator.wikimedia.org/P27719 and previous config saved to /var/cache/conftool/dbconfig/20220505-122909-root.json [12:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host restbase1031.eqiad.wmnet [12:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:33] (03PS4) 10Jbond: Update Makefile for Bullseye support [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/789571 (owner: 10Ayounsi) [12:31:44] (03CR) 10Gehel: tlsproxy: manage ssl_ecdhe_curve internally (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/789559 (https://phabricator.wikimedia.org/T307510) (owner: 10Gehel) [12:31:45] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab1004.wikimedia.org [12:31:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:10] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab2002.wikimedia.org [12:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:26] (03CR) 10Gehel: tlsproxy: manage ssl_ecdhe_curve internally (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/789559 (https://phabricator.wikimedia.org/T307510) (owner: 10Gehel) [12:34:17] (03CR) 10Jbond: [C: 03+2] ci: on castor server drop /srv requirement [puppet] - 10https://gerrit.wikimedia.org/r/774525 (https://phabricator.wikimedia.org/T252071) (owner: 10Hashar) [12:34:53] (03CR) 10Jbond: [C: 03+2] ci: relocate castor storage directory [puppet] - 10https://gerrit.wikimedia.org/r/774771 (https://phabricator.wikimedia.org/T252071) (owner: 10Hashar) [12:36:02] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab2002.wikimedia.org [12:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:33] !log aqu@deploy1002 Started deploy [analytics/refinery@6b9b65d]: Regular analytics weekly train [analytics/refinery@6b9b65d] [12:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:40] 10SRE, 10SRE-OnFire (FY2021/2022-Q4), 10DBA, 10GlobalBlocking, 10Wikimedia-Incident: 2022-05-05 Wikimedia full site outage - https://phabricator.wikimedia.org/T307647 (10akosiaris) Just for greater visibility and awareness, there is T301505 for the `upstream connect error or disconnect/reset before heade... [12:38:44] (03PS4) 10EllenR: Set log level to 'debug' for mediamoderation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788777 (https://phabricator.wikimedia.org/T303312) [12:39:40] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host restbase1031.eqiad.wmnet [12:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:34] 10ops-eqiad: restbase1031 NIC with limited connection speed after reboot - https://phabricator.wikimedia.org/T307677 (10MoritzMuehlenhoff) [12:44:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T307525)', diff saved to https://phabricator.wikimedia.org/P27720 and previous config saved to /var/cache/conftool/dbconfig/20220505-124401-ladsgroup.json [12:44:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [12:44:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [12:44:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:09] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [12:44:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 50%: After the incident', diff saved to https://phabricator.wikimedia.org/P27721 and previous config saved to /var/cache/conftool/dbconfig/20220505-124413-root.json [12:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host restbase1032.eqiad.wmnet [12:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:11] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab2003.wikimedia.org [12:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1032.eqiad.wmnet [12:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:20] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab2003.wikimedia.org [12:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host restbase1033.eqiad.wmnet [12:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:34] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab-runner1001.eqiad.wmnet [12:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:54] 10SRE-OnFire, 10DBA, 10Blocked-on-schema-change, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501 (10Marostegui) [12:53:06] (03CR) 10David Caro: [C: 03+1] "Yay! \o/" [puppet] - 10https://gerrit.wikimedia.org/r/789588 (https://phabricator.wikimedia.org/T301993) (owner: 10Majavah) [12:53:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [12:53:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [12:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [12:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [12:53:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1033.eqiad.wmnet [12:53:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T307525)', diff saved to https://phabricator.wikimedia.org/P27722 and previous config saved to /var/cache/conftool/dbconfig/20220505-125806-ladsgroup.json [12:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:11] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [12:58:52] PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:59:17] (03CR) 10Hnowlan: [C: 03+2] Migrate tests from nose to pytest [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/789163 (https://phabricator.wikimedia.org/T303866) (owner: 10Roman Stolar) [12:59:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 75%: After the incident', diff saved to https://phabricator.wikimedia.org/P27723 and previous config saved to /var/cache/conftool/dbconfig/20220505-125917-root.json [12:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:05] RoanKattouw, Lucas_WMDE, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220505T1300). [13:00:05] tgr and Func: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:20] (03Merged) 10jenkins-bot: Migrate tests from nose to pytest [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/789163 (https://phabricator.wikimedia.org/T303866) (owner: 10Roman Stolar) [13:01:16] tgr_ / tgr: I assume you’ll self-service? [13:01:18] I can deploy [13:01:21] ok [13:02:10] Func: around? [13:02:16] yep [13:02:22] (03CR) 10Gergő Tisza: [C: 03+2] [TOC] Remove pointer-events:none on .sidebar-toc-link [skins/Vector] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/789327 (https://phabricator.wikimedia.org/T307271) (owner: 10Func) [13:02:47] (03CR) 10Gergő Tisza: [C: 03+2] GrothExperiments: Enable Add Link backend on tier 3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789556 (https://phabricator.wikimedia.org/T304542) (owner: 10Gergő Tisza) [13:03:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T307525)', diff saved to https://phabricator.wikimedia.org/P27724 and previous config saved to /var/cache/conftool/dbconfig/20220505-130313-ladsgroup.json [13:03:15] (03CR) 10Samtar: "My (uninformed) "test plan" is at https://phabricator.wikimedia.org/T274359#7751644 fwiw" [deployment-charts] - 10https://gerrit.wikimedia.org/r/767878 (https://phabricator.wikimedia.org/T274359) (owner: 10Samtar) [13:03:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:18] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [13:03:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [13:03:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [13:03:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:36] (03Merged) 10jenkins-bot: GrothExperiments: Enable Add Link backend on tier 3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789556 (https://phabricator.wikimedia.org/T304542) (owner: 10Gergő Tisza) [13:06:32] !log aqu@deploy1002 Finished deploy [analytics/refinery@6b9b65d]: Regular analytics weekly train [analytics/refinery@6b9b65d] (duration: 29m 59s) [13:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:53] !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:789556|GrothExperiments: Enable Add Link backend on tier 3 wikis (T304542)]] (duration: 00m 49s) [13:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:58] T304542: Deploy "add a link" to third round of wikis - https://phabricator.wikimedia.org/T304542 [13:07:44] !log aqu@deploy1002 Started deploy [analytics/refinery@6b9b65d] (thin): Regular analytics weekly train THIN [analytics/refinery@6b9b65d] [13:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:52] !log aqu@deploy1002 Finished deploy [analytics/refinery@6b9b65d] (thin): Regular analytics weekly train THIN [analytics/refinery@6b9b65d] (duration: 00m 08s) [13:07:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:11] !log aqu@deploy1002 Started deploy [analytics/refinery@6b9b65d] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@6b9b65d] [13:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:34] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-cache1001.eqiad.wmnet [13:08:36] !log klausman@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ml-cache1001.eqiad.wmnet [13:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:09:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:36] (03CR) 10EllenR: "did rebase" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788777 (https://phabricator.wikimedia.org/T303312) (owner: 10EllenR) [13:12:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [13:12:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [13:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T307525)', diff saved to https://phabricator.wikimedia.org/P27725 and previous config saved to /var/cache/conftool/dbconfig/20220505-131253-ladsgroup.json [13:12:54] (03PS5) 10EllenR: Set log level to 'debug' for mediamoderation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788777 (https://phabricator.wikimedia.org/T303312) [13:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:58] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [13:13:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P27726 and previous config saved to /var/cache/conftool/dbconfig/20220505-131311-ladsgroup.json [13:13:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 100%: After the incident', diff saved to https://phabricator.wikimedia.org/P27727 and previous config saved to /var/cache/conftool/dbconfig/20220505-131421-root.json [13:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:11] !log aqu@deploy1002 Finished deploy [analytics/refinery@6b9b65d] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@6b9b65d] (duration: 07m 00s) [13:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:56] (03CR) 10Filippo Giunchedi: "Two comments inline, LGTM overall" [puppet] - 10https://gerrit.wikimedia.org/r/789592 (https://phabricator.wikimedia.org/T306996) (owner: 10Muehlenhoff) [13:16:39] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be2051.codfw.wmnet with OS bullseye [13:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:44] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-be2051.codfw.wmnet with OS bullseye [13:17:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid2002.codfw.wmnet [13:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P27728 and previous config saved to /var/cache/conftool/dbconfig/20220505-131818-ladsgroup.json [13:18:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:47] (03Merged) 10jenkins-bot: [TOC] Remove pointer-events:none on .sidebar-toc-link [skins/Vector] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/789327 (https://phabricator.wikimedia.org/T307271) (owner: 10Func) [13:20:28] !log herron@cumin1001 START - Cookbook sre.kafka.reboot-workers for Kafka main-codfw cluster: Reboot kafka nodes [13:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid2002.codfw.wmnet [13:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid1002.eqiad.wmnet [13:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid1002.eqiad.wmnet [13:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T307525)', diff saved to https://phabricator.wikimedia.org/P27729 and previous config saved to /var/cache/conftool/dbconfig/20220505-132530-ladsgroup.json [13:25:33] !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:789556|GrothExperiments: Enable Add Link backend on tier 3 wikis (T304542)]] (again, used the wrong directory before) (duration: 00m 48s) [13:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:36] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [13:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:40] T304542: Deploy "add a link" to third round of wikis - https://phabricator.wikimedia.org/T304542 [13:25:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:26:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:02] Func: it's on mwdebug1001 [13:28:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P27730 and previous config saved to /var/cache/conftool/dbconfig/20220505-132816-ladsgroup.json [13:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:55] !log mvernon@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2051.codfw.wmnet with OS bullseye [13:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:59] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-be2051.codfw.wmnet with OS bullseye executed with errors: - ms-be2051 (**FAIL**)... [13:29:32] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be2051.codfw.wmnet with OS bullseye [13:29:32] tgr: Tested, good to go [13:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:35] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-be2051.codfw.wmnet with OS bullseye [13:30:29] PROBLEM - Check systemd state on gitlab-runner1001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-resource-monitor.service,docker.service,docker.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:30:49] ^ thats me [13:30:49] !log tgr@deploy1002 Synchronized php-1.39.0-wmf.10/skins/Vector/resources: Backport: [[gerrit:789327|[TOC] Remove pointer-events:none on .sidebar-toc-link (T307271)]] (duration: 00m 49s) [13:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:53] T307271: Links in the TOC of vector 2022 are not clickable for some Chromium based browsers - https://phabricator.wikimedia.org/T307271 [13:31:00] (03PS1) 10Majavah: P:wmcs: unify toolsdb profiles [puppet] - 10https://gerrit.wikimedia.org/r/789611 [13:31:11] Func: thanks, it's live [13:31:20] thank you! [13:31:56] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35100/console" [puppet] - 10https://gerrit.wikimedia.org/r/789611 (owner: 10Majavah) [13:33:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P27731 and previous config saved to /var/cache/conftool/dbconfig/20220505-133324-ladsgroup.json [13:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:51] !log UTC afternoon deploys done [13:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:45] (03PS1) 10Peter Bowman: Add localized wordmark for plwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789613 (https://phabricator.wikimedia.org/T307683) [13:39:53] (03CR) 10Btullis: [C: 03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/789599 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [13:40:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P27732 and previous config saved to /var/cache/conftool/dbconfig/20220505-134035-ladsgroup.json [13:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:59] (03PS2) 10Muehlenhoff: Enable repo sync for node14 [puppet] - 10https://gerrit.wikimedia.org/r/789592 (https://phabricator.wikimedia.org/T306996) [13:41:15] (03CR) 10Muehlenhoff: Enable repo sync for node14 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/789592 (https://phabricator.wikimedia.org/T306996) (owner: 10Muehlenhoff) [13:41:43] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:43:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T307525)', diff saved to https://phabricator.wikimedia.org/P27733 and previous config saved to /var/cache/conftool/dbconfig/20220505-134321-ladsgroup.json [13:43:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:26] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [13:45:34] !log mvernon@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2051.codfw.wmnet with OS bullseye [13:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:40] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-be2051.codfw.wmnet with OS bullseye executed with errors: - ms-be2051 (**FAIL**)... [13:46:00] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be2051.codfw.wmnet with OS bullseye [13:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:04] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-be2051.codfw.wmnet with OS bullseye [13:46:20] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/789592 (https://phabricator.wikimedia.org/T306996) (owner: 10Muehlenhoff) [13:47:02] (03PS1) 10Jelto: gitlab_runner: add overlayfs [puppet] - 10https://gerrit.wikimedia.org/r/789616 (https://phabricator.wikimedia.org/T307668) [13:48:15] (03CR) 10Muehlenhoff: [C: 03+2] Enable repo sync for node14 [puppet] - 10https://gerrit.wikimedia.org/r/789592 (https://phabricator.wikimedia.org/T306996) (owner: 10Muehlenhoff) [13:48:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T307525)', diff saved to https://phabricator.wikimedia.org/P27734 and previous config saved to /var/cache/conftool/dbconfig/20220505-134829-ladsgroup.json [13:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:34] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [13:48:59] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35101/console" [puppet] - 10https://gerrit.wikimedia.org/r/789616 (https://phabricator.wikimedia.org/T307668) (owner: 10Jelto) [13:50:55] RECOVERY - Check systemd state on ms-be2046 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:50:59] (03CR) 10Andrew Bogott: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/786274 (https://phabricator.wikimedia.org/T218426) (owner: 10David Caro) [13:51:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:55:28] (03CR) 10Ottomata: "This this is a chart change rather than a helmfile values change, you'll need to bump the chat version in changeprop/Chart.yaml too." [deployment-charts] - 10https://gerrit.wikimedia.org/r/767878 (https://phabricator.wikimedia.org/T274359) (owner: 10Samtar) [13:55:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P27735 and previous config saved to /var/cache/conftool/dbconfig/20220505-135540-ladsgroup.json [13:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:59] RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:00:00] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2051.codfw.wmnet with reason: host reimage [14:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [14:00:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [14:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T307525)', diff saved to https://phabricator.wikimedia.org/P27736 and previous config saved to /var/cache/conftool/dbconfig/20220505-140024-ladsgroup.json [14:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:28] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [14:01:40] (03CR) 10Vgutierrez: P:varnish::common: Add support for passing wikimedia_domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/768766 (owner: 10Jbond) [14:03:30] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2051.codfw.wmnet with reason: host reimage [14:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:15] (03CR) 10JMeybohm: [C: 03+1] "This will fix your issue, but I'd suggest to double check why docker actually worked until now as the overlayfs module should have been bl" [puppet] - 10https://gerrit.wikimedia.org/r/789616 (https://phabricator.wikimedia.org/T307668) (owner: 10Jelto) [14:08:28] (03PS1) 10Btullis: Increase the total number of HDFS files before the alert triggers [alerts] - 10https://gerrit.wikimedia.org/r/789618 (https://phabricator.wikimedia.org/T307549) [14:10:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T307525)', diff saved to https://phabricator.wikimedia.org/P27737 and previous config saved to /var/cache/conftool/dbconfig/20220505-141045-ladsgroup.json [14:10:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [14:10:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [14:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:51] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [14:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T307525)', diff saved to https://phabricator.wikimedia.org/P27738 and previous config saved to /var/cache/conftool/dbconfig/20220505-141053-ladsgroup.json [14:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:16] (03CR) 10jerkins-bot: [V: 04-1] Increase the total number of HDFS files before the alert triggers [alerts] - 10https://gerrit.wikimedia.org/r/789618 (https://phabricator.wikimedia.org/T307549) (owner: 10Btullis) [14:12:08] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:16:43] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2051.codfw.wmnet with OS bullseye [14:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:47] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-be2051.codfw.wmnet with OS bullseye completed: - ms-be2051 (**PASS**) - Removed... [14:18:32] (03PS1) 10Muehlenhoff: Add missing update config for node14 sync Change-Id: Idbdde572930a03869a0a54b2d69830f113742a06 [puppet] - 10https://gerrit.wikimedia.org/r/789619 [14:19:12] (03CR) 10jerkins-bot: [V: 04-1] Add missing update config for node14 sync Change-Id: Idbdde572930a03869a0a54b2d69830f113742a06 [puppet] - 10https://gerrit.wikimedia.org/r/789619 (owner: 10Muehlenhoff) [14:20:48] (03PS2) 10Muehlenhoff: Add missing update config for node14 sync [puppet] - 10https://gerrit.wikimedia.org/r/789619 [14:21:25] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-staging2001.codfw.wmnet [14:21:27] !log klausman@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ml-staging2001.codfw.wmnet [14:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T307525)', diff saved to https://phabricator.wikimedia.org/P27739 and previous config saved to /var/cache/conftool/dbconfig/20220505-142136-ladsgroup.json [14:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:41] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [14:22:25] (03CR) 10Bking: [C: 03+2] elastic: enable/disable ssl_ecdhe_curve based on OS version [puppet] - 10https://gerrit.wikimedia.org/r/788815 (https://phabricator.wikimedia.org/T307510) (owner: 10Bking) [14:23:31] PROBLEM - Check systemd state on ms-be2051 is CRITICAL: CRITICAL - degraded: The following units failed: swift-object.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:57] (03CR) 10Muehlenhoff: [C: 03+2] Add missing update config for node14 sync [puppet] - 10https://gerrit.wikimedia.org/r/789619 (owner: 10Muehlenhoff) [14:27:19] PROBLEM - Host ml-staging2002 is DOWN: PING CRITICAL - Packet loss = 100% [14:28:43] RECOVERY - Host ml-staging2002 is UP: PING OK - Packet loss = 0%, RTA = 33.19 ms [14:30:02] 10SRE, 10serviceops, 10Patch-For-Review: Provide node14 images for running production node-based services - https://phabricator.wikimedia.org/T306996 (10MoritzMuehlenhoff) I've added repo sync definitions for node 14 on Bullseye (and imported 14.19.2 into it). I suppose 16 is needed as well? [14:32:29] RECOVERY - Check systemd state on ms-be2051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:33] PROBLEM - Host ml-serve-ctrl2001 is DOWN: PING CRITICAL - Packet loss = 100% [14:34:11] RECOVERY - Host ml-serve-ctrl2001 is UP: PING OK - Packet loss = 0%, RTA = 33.33 ms [14:35:58] (03PS3) 10Samtar: changeprop: Remove RESTBase page blacklist [deployment-charts] - 10https://gerrit.wikimedia.org/r/767878 (https://phabricator.wikimedia.org/T274359) [14:36:03] PROBLEM - Host ml-serve-ctrl2002 is DOWN: PING CRITICAL - Packet loss = 100% [14:36:29] 10ops-codfw: Recycling Pickup for CODFW - https://phabricator.wikimedia.org/T307694 (10Papaul) [14:36:41] 10ops-codfw: Recycling Pickup for CODFW - https://phabricator.wikimedia.org/T307694 (10Papaul) p:05Triage→03Medium [14:36:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P27740 and previous config saved to /var/cache/conftool/dbconfig/20220505-143641-ladsgroup.json [14:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:51] RECOVERY - Host ml-serve-ctrl2002 is UP: PING OK - Packet loss = 0%, RTA = 33.34 ms [14:37:23] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:39:13] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 94, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:42:27] (03CR) 10David Caro: [C: 03+2] openstack: remove ussuri files [puppet] - 10https://gerrit.wikimedia.org/r/786274 (https://phabricator.wikimedia.org/T218426) (owner: 10David Caro) [14:43:38] 10ops-codfw: Recycling Pickup for CODFW - https://phabricator.wikimedia.org/T307694 (10wiki_willy) Submitted in Coupa via: https://wikimedia.coupahost.com/easy_form_responses/3043 Also, attached are the quotes from Sipi: {F35104913} {F35104912} [14:43:58] (03PS2) 10David Caro: openstack: remove ussuri files [puppet] - 10https://gerrit.wikimedia.org/r/786274 (https://phabricator.wikimedia.org/T218426) [14:45:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:46:42] (03CR) 10Vgutierrez: [C: 03+1] Expand stick-table test to three other hosts [puppet] - 10https://gerrit.wikimedia.org/r/789219 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis) [14:49:05] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:49:41] PROBLEM - Host ml-serve2001 is DOWN: PING CRITICAL - Packet loss = 100% [14:50:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:50:38] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: add overlayfs [puppet] - 10https://gerrit.wikimedia.org/r/789616 (https://phabricator.wikimedia.org/T307668) (owner: 10Jelto) [14:51:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P27741 and previous config saved to /var/cache/conftool/dbconfig/20220505-145146-ladsgroup.json [14:51:49] RECOVERY - Host ml-serve2001 is UP: PING WARNING - Packet loss = 77%, RTA = 31.70 ms [14:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:29] (03CR) 10Ahmon Dancy: [C: 03+1] scap: add system package requirements for scap [puppet] - 10https://gerrit.wikimedia.org/r/789147 (https://phabricator.wikimedia.org/T306991) (owner: 10Jaime Nuche) [14:52:37] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:53:37] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 94, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:55:02] (03CR) 10Ahmon Dancy: [C: 03+1] scap: add new `scap` user to deployment hosts and scap targets [puppet] - 10https://gerrit.wikimedia.org/r/789146 (https://phabricator.wikimedia.org/T306991) (owner: 10Jaime Nuche) [14:55:21] RECOVERY - Check systemd state on logstash2031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:56:37] RECOVERY - Check systemd state on gitlab-runner1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:59:17] PROBLEM - Host ml-serve2002 is DOWN: PING CRITICAL - Packet loss = 100% [15:00:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T307525)', diff saved to https://phabricator.wikimedia.org/P27742 and previous config saved to /var/cache/conftool/dbconfig/20220505-150038-ladsgroup.json [15:00:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:43] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [15:01:17] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:01:43] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab-runner1001.eqiad.wmnet [15:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:03:05] RECOVERY - Host ml-serve2002 is UP: PING OK - Packet loss = 0%, RTA = 31.62 ms [15:03:45] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 110.4 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [15:03:45] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35102/console" [puppet] - 10https://gerrit.wikimedia.org/r/786274 (https://phabricator.wikimedia.org/T218426) (owner: 10David Caro) [15:03:48] (03CR) 10Hnowlan: [C: 03+1] "Seems sensible. I'll merge and deploy this early next week if that's okay - there's a moderate amount of risk here and I won't be able to " [deployment-charts] - 10https://gerrit.wikimedia.org/r/767878 (https://phabricator.wikimedia.org/T274359) (owner: 10Samtar) [15:03:57] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 94, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:05:52] (03CR) 10David Caro: [V: 03+1 C: 03+2] openstack: remove ussuri files [puppet] - 10https://gerrit.wikimedia.org/r/786274 (https://phabricator.wikimedia.org/T218426) (owner: 10David Caro) [15:06:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T307525)', diff saved to https://phabricator.wikimedia.org/P27743 and previous config saved to /var/cache/conftool/dbconfig/20220505-150651-ladsgroup.json [15:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:57] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [15:09:03] PROBLEM - Host ml-serve2003 is DOWN: PING CRITICAL - Packet loss = 100% [15:09:14] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab-runner1002.eqiad.wmnet [15:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:02] (03CR) 10Samtar: changeprop: Remove RESTBase page blacklist (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/767878 (https://phabricator.wikimedia.org/T274359) (owner: 10Samtar) [15:11:27] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:11:39] RECOVERY - Host ml-serve2003 is UP: PING OK - Packet loss = 0%, RTA = 33.19 ms [15:11:43] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:13:07] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 127, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:13:25] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 94, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:14:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [15:15:15] (03CR) 10Jgiannelos: [C: 04-1] "We dont use master because we maintain or own fork please update the branch to wmf/v0.14.x" [software/tegola] - 10https://gerrit.wikimedia.org/r/789222 (https://phabricator.wikimedia.org/T307507) (owner: 10Dduvall) [15:15:15] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P27744 and previous config saved to /var/cache/conftool/dbconfig/20220505-151543-ladsgroup.json [15:15:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:31] PROBLEM - Host ml-serve2004 is DOWN: PING CRITICAL - Packet loss = 100% [15:16:57] 10SRE, 10Generated Data Platform, 10Image-Suggestions, 10serviceops, and 3 others: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 (10WDoranWMF) @Dzahn thanks so much for the help [15:17:25] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab-runner1002.eqiad.wmnet [15:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:37] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab-runner1003.eqiad.wmnet [15:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:55] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-main2002 is CRITICAL: 28 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2002 [15:18:57] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:19:13] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:19:39] RECOVERY - Host ml-serve2004 is UP: PING OK - Packet loss = 0%, RTA = 33.60 ms [15:19:51] (03PS1) 10Jbond: C:netbox: update docs and fix minor lint issue [puppet] - 10https://gerrit.wikimedia.org/r/789629 [15:20:37] (03PS1) 10Stang: urwiki: allow "sysop" to add/remove "eliminator" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789630 (https://phabricator.wikimedia.org/T307029) [15:20:55] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 127, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:21:04] 10SRE, 10MediaWiki-extensions-CodeReview, 10Platform Engineering, 10serviceops-radar, 10Patch-For-Review: Make an HTML dump of the output of the CodeReview extension on MediaWiki.org - https://phabricator.wikimedia.org/T205361 (10WDoranWMF) @Jdforrester-WMF I recall this issue from, at least, more than a... [15:21:15] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 94, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:21:37] (03PS1) 10Aqu: role::common::aqs: Update mediawiki history source of aqs [puppet] - 10https://gerrit.wikimedia.org/r/789631 [15:22:59] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10Multichill) >>! In T279637#7906343, @MatthewVernon wrote: > Broadly, there should be little impact (and our monitoring suggests error rates within expected ranges); I hope any err... [15:23:34] (03CR) 10Stang: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789613 (https://phabricator.wikimedia.org/T307683) (owner: 10Peter Bowman) [15:23:48] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve2005.codfw.wmnet [15:23:51] !log klausman@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ml-serve2005.codfw.wmnet [15:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:59] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve2005.codfw.wmnet [15:24:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:33] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab-runner1003.eqiad.wmnet [15:24:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:46] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab-runner1004.eqiad.wmnet [15:24:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:39] hashar: Train looks good; solid enough that you're not going to roll back all the way to wmf.9 even on group0? If so, I can deploy T301483 finally. [15:26:40] T301483: Change the copyright warning for Mediawikiwiki's Help: namespace to CC0 - https://phabricator.wikimedia.org/T301483 [15:27:23] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-main2001 is CRITICAL: 65 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2001 [15:27:27] RECOVERY - k8s requests count to the API on ml-serve-ctrl2001 is OK: (C)100 ge (W)50 ge 32.84 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [15:27:47] James_F: hopefully, but we might find a regression later tofay which will prompt a rollback [15:27:52] (03CR) 10Dduvall: [C: 03+1] ci: Provide basic `.pipeline/config.yaml` (031 comment) [software/tegola] - 10https://gerrit.wikimedia.org/r/789222 (https://phabricator.wikimedia.org/T307507) (owner: 10Dduvall) [15:28:00] hashar: All the way? :-( [15:28:08] I guess I can wait 'til next week. [15:28:12] Or risk it. [15:28:15] Maybe. Who knows what we broke [15:28:28] Move slowly and break stuff anyway® [15:28:53] ^ wikimedia release engineering should probably adopt that as a motto [15:29:02] * James_F grins. [15:29:06] if your change does not cause wmf.9 to cause an outage, it is probably fine [15:29:08] Y'all are amazing. [15:29:18] hashar: It'd only cause fatals on MW.org, so that's fine right? ;-) [15:29:49] definitely not. I would rather wait in this case :) [15:29:56] Ack. Will wait. [15:30:05] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-main2003 is CRITICAL: 32 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2003 [15:30:11] PROBLEM - Host kafka-main2005 is DOWN: PING CRITICAL - Packet loss = 100% [15:30:30] then if nothing is suspicious later today, sure be bold. [15:30:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P27745 and previous config saved to /var/cache/conftool/dbconfig/20220505-153048-ladsgroup.json [15:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:24] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2005.codfw.wmnet [15:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:49] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab-runner1004.eqiad.wmnet [15:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:13] (03PS1) 10Jbond: C:netbox: Add scap_repo parameter [puppet] - 10https://gerrit.wikimedia.org/r/789634 (https://phabricator.wikimedia.org/T296452) [15:32:15] (03PS1) 10Jbond: O:netbox::standalone: use netbox-next/deploy scap repo [puppet] - 10https://gerrit.wikimedia.org/r/789635 [15:32:21] (03PS1) 10Dduvall: ci: Provide basic `.pipeline/config.yaml` [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/789636 (https://phabricator.wikimedia.org/T307507) [15:32:39] RECOVERY - Host kafka-main2005 is UP: PING OK - Packet loss = 0%, RTA = 33.21 ms [15:32:49] (03CR) 10Jbond: [C: 03+2] C:netbox: update docs and fix minor lint issue [puppet] - 10https://gerrit.wikimedia.org/r/789629 (owner: 10Jbond) [15:32:53] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve2006.codfw.wmnet [15:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:02] (03Abandoned) 10Dduvall: ci: Provide basic `.pipeline/config.yaml` [software/tegola] - 10https://gerrit.wikimedia.org/r/789222 (https://phabricator.wikimedia.org/T307507) (owner: 10Dduvall) [15:33:04] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35103/console" [puppet] - 10https://gerrit.wikimedia.org/r/789635 (owner: 10Jbond) [15:33:05] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-main2002 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2002 [15:33:21] (03PS2) 10Jbond: C:netbox: Add scap_repo parameter [puppet] - 10https://gerrit.wikimedia.org/r/789634 (https://phabricator.wikimedia.org/T296452) [15:33:30] (03CR) 10jerkins-bot: [V: 04-1] O:netbox::standalone: use netbox-next/deploy scap repo [puppet] - 10https://gerrit.wikimedia.org/r/789635 (owner: 10Jbond) [15:33:39] (03PS2) 10Jbond: O:netbox::standalone: use netbox-next/deploy scap repo [puppet] - 10https://gerrit.wikimedia.org/r/789635 [15:33:44] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab-runner2001.codfw.wmnet [15:33:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:59] (03PS3) 10Jbond: O:netbox::standalone: use netbox-next/deploy scap repo [puppet] - 10https://gerrit.wikimedia.org/r/789635 (https://phabricator.wikimedia.org/T296452) [15:34:07] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-main2001 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2001 [15:34:11] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35104/console" [puppet] - 10https://gerrit.wikimedia.org/r/789634 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [15:34:33] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-main2003 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2003 [15:35:54] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [15:35:55] (03PS3) 10Jbond: C:netbox: Add scap_repo parameter [puppet] - 10https://gerrit.wikimedia.org/r/789634 (https://phabricator.wikimedia.org/T296452) [15:37:08] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35105/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/789634 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [15:38:09] (03PS4) 10Jbond: O:netbox::standalone: use netbox-next/deploy scap repo [puppet] - 10https://gerrit.wikimedia.org/r/789635 (https://phabricator.wikimedia.org/T296452) [15:39:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [15:40:16] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2006.codfw.wmnet [15:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [15:42:16] (03PS1) 10Hnowlan: postgres: allow enabling the slow query log on replicas [puppet] - 10https://gerrit.wikimedia.org/r/789638 (https://phabricator.wikimedia.org/T307671) [15:43:24] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve2007.codfw.wmnet [15:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:12] (03PS1) 10Ayounsi: Add netbox-dev directory on deploy servers [puppet] - 10https://gerrit.wikimedia.org/r/789639 (https://phabricator.wikimedia.org/T296452) [15:44:18] !log jelto@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host gitlab-runner2001.codfw.wmnet [15:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:11] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (NOOP 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35107/console" [puppet] - 10https://gerrit.wikimedia.org/r/789638 (https://phabricator.wikimedia.org/T307671) (owner: 10Hnowlan) [15:45:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T307525)', diff saved to https://phabricator.wikimedia.org/P27746 and previous config saved to /var/cache/conftool/dbconfig/20220505-154553-ladsgroup.json [15:45:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [15:45:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [15:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:58] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [15:45:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T307525)', diff saved to https://phabricator.wikimedia.org/P27747 and previous config saved to /var/cache/conftool/dbconfig/20220505-154607-ladsgroup.json [15:46:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:15] 10SRE, 10SRE-swift-storage, 10ops-eqiad: Power drain and restart of ms-be1059 - https://phabricator.wikimedia.org/T307667 (10Cmjohnson) Unfortunately, this is an HP server and the server will not power on and will need a multitude of tests just to figure out what the problem could be and several hours of cal... [15:46:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:47:43] (03PS2) 10Hnowlan: postgres: allow enabling the slow query log on replicas [puppet] - 10https://gerrit.wikimedia.org/r/789638 (https://phabricator.wikimedia.org/T307671) [15:48:52] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab-runner2002.codfw.wmnet [15:48:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:47] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2007.codfw.wmnet [15:50:47] PROBLEM - Host ms-be1059.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:05] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35109/console" [puppet] - 10https://gerrit.wikimedia.org/r/789638 (https://phabricator.wikimedia.org/T307671) (owner: 10Hnowlan) [15:51:44] (03CR) 10Hnowlan: postgres: allow enabling the slow query log on replicas [puppet] - 10https://gerrit.wikimedia.org/r/789638 (https://phabricator.wikimedia.org/T307671) (owner: 10Hnowlan) [15:52:29] !log herron@cumin1001 END (PASS) - Cookbook sre.kafka.reboot-workers (exit_code=0) for Kafka main-codfw cluster: Reboot kafka nodes [15:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:59] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve2008.codfw.wmnet [15:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:22] 10SRE, 10SRE-swift-storage, 10ops-eqiad: Power drain and restart of ms-be1059 - https://phabricator.wikimedia.org/T307667 (10MatthewVernon) Oh, bother. The problem is that our swift clusters can tolerate one failed system; so I can't straightforwardly do any more reimages in the eqiad ms- cluster while this... [15:54:42] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/pcc-worker1002/35110/deploy1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/789639 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [15:54:58] (03CR) 10Jgiannelos: [C: 03+1] postgres: allow enabling the slow query log on replicas [puppet] - 10https://gerrit.wikimedia.org/r/789638 (https://phabricator.wikimedia.org/T307671) (owner: 10Hnowlan) [15:55:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:55:56] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab-runner2002.codfw.wmnet [15:55:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:35] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab-runner2003.codfw.wmnet [15:56:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:29] (03CR) 10Joal: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/789631 (owner: 10Aqu) [15:59:59] RECOVERY - Host ms-be1059.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [16:00:05] jbond and rzl: (Dis)respected human, time to deploy Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220505T1600). Please do the needful. [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:22] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2008.codfw.wmnet [16:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:56] (03CR) 10Razzi: [C: 03+2] role::common::aqs: Update mediawiki history source of aqs [puppet] - 10https://gerrit.wikimedia.org/r/789631 (owner: 10Aqu) [16:03:04] 10SRE, 10ops-eqiad: restbase1031 NIC with limited connection speed after reboot - https://phabricator.wikimedia.org/T307677 (10Cmjohnson) @MoritzMuehlenhoff I replaced the cable, that should correct the problem [16:03:40] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab-runner2003.codfw.wmnet [16:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:16] !log razzi@cumin1001 START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [16:04:19] (03PS1) 10Bking: elasticsearch: Java version is a fact, does not need to be a param [puppet] - 10https://gerrit.wikimedia.org/r/789644 (https://phabricator.wikimedia.org/T289135) [16:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:15] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: Java version is a fact, does not need to be a param [puppet] - 10https://gerrit.wikimedia.org/r/789644 (https://phabricator.wikimedia.org/T289135) (owner: 10Bking) [16:05:31] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab-runner2004.codfw.wmnet [16:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:12] (03CR) 10David Caro: P:wmcs: unify toolsdb profiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/789611 (owner: 10Majavah) [16:07:47] !log razzi@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [16:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:35] Hi, I have a question about backporting [16:08:50] I followed the instruction on https://www.mediawiki.org/wiki/Gerrit/Advanced_usage , but I got an error [16:09:00] (03PS1) 10Ssingh: durum: add monitoring::service for the check service [puppet] - 10https://gerrit.wikimedia.org/r/789646 [16:09:15] said " ! [remote rejected]", failed to push some refs ... [16:09:48] (03PS2) 10Ssingh: durum: add monitoring::service for the check service [puppet] - 10https://gerrit.wikimedia.org/r/789646 [16:10:36] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35112/console" [puppet] - 10https://gerrit.wikimedia.org/r/789646 (owner: 10Ssingh) [16:11:39] (03PS1) 10Dzahn: puppetmaster::geoip: remove legacy product IDs even for fallback option [puppet] - 10https://gerrit.wikimedia.org/r/789648 [16:12:04] koi: What/where are you pushing it to? [16:12:25] task is https://phabricator.wikimedia.org/T307675 [16:12:35] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab-runner2004.codfw.wmnet [16:12:36] (03PS2) 10Bking: elasticsearch: Java version is a fact, does not need to be a param [puppet] - 10https://gerrit.wikimedia.org/r/789644 (https://phabricator.wikimedia.org/T289135) [16:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:48] and I run command like `git push origin HEAD:refs/for/wmf/1.39wmf10/2022/suppress-named-group` [16:14:00] the branch would be wmf/1.39.0-wmf.10 [16:14:01] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:14:06] 10SRE, 10ops-eqiad, 10DBA: db1164 fails to POST/boot/etc - https://phabricator.wikimedia.org/T307198 (10Cmjohnson) Dell ticket number You have successfully submitted request SR1092770830. [16:14:16] But for simpler stuff, you might aswell just use "cherry pick" in the gerrit U [16:14:17] UI [16:14:19] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/789644 (https://phabricator.wikimedia.org/T289135) (owner: 10Bking) [16:14:21] RECOVERY - SSH on ms-be1059.mgmt is OK: SSH OK - mpSSH_0.2.1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:14:50] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: name=maps1007.eqiad.wmnet [16:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:07] (03PS1) 10Stang: Suppress "named" group when TempUser system is disabled [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/789332 (https://phabricator.wikimedia.org/T307675) [16:15:15] (03PS2) 10Dzahn: puppetmaster::geoip: remove legacy product IDs even for fallback option [puppet] - 10https://gerrit.wikimedia.org/r/789648 [16:15:29] !log T307671 depool maps1007 from traffic per suggestion. [16:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:33] T307671: High rate of 5XX errors from maps.wikimedia.org since 2022-05-05 ~03:20 - https://phabricator.wikimedia.org/T307671 [16:15:35] (03PS1) 10Stang: Add messages for the "named" user group [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/789333 (https://phabricator.wikimedia.org/T307675) [16:15:49] (03PS3) 10Dzahn: puppetmaster::geoip: remove legacy product IDs even for fallback option [puppet] - 10https://gerrit.wikimedia.org/r/789648 [16:16:26] Reedy, I did via UI, and do I need to change the topic to the same as that one on master [16:16:39] No [16:16:44] (03PS4) 10Dzahn: puppetmaster::geoip: remove legacy product IDs even for fallback option [puppet] - 10https://gerrit.wikimedia.org/r/789648 (https://phabricator.wikimedia.org/T302864) [16:17:04] got it, thanks [16:18:04] 10SRE, 10Analytics-Radar, 10Traffic-Icebox, 10User-jbond: Fix geoip updaters for new MaxMind hashed keys by 2019-08-15 - https://phabricator.wikimedia.org/T228533 (10Dzahn) I think this can be closed since it's in the past and superseded by T302864. [16:18:20] 10SRE, 10Analytics-Radar, 10Traffic-Icebox, 10User-jbond: Fix geoip updaters for new MaxMind hashed keys by 2019-08-15 - https://phabricator.wikimedia.org/T228533 (10Dzahn) [16:20:55] 10SRE, 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, and 3 others: Maxmind: GeoIP Download Failed - https://phabricator.wikimedia.org/T302864 (10Dzahn) [16:21:07] (03PS5) 10Dzahn: puppetmaster::geoip: remove legacy product IDs even for fallback option [puppet] - 10https://gerrit.wikimedia.org/r/789648 (https://phabricator.wikimedia.org/T302864) [16:21:32] (03CR) 10Dzahn: [C: 03+2] "beta only" [puppet] - 10https://gerrit.wikimedia.org/r/789648 (https://phabricator.wikimedia.org/T302864) (owner: 10Dzahn) [16:25:28] 10SRE, 10SRE-swift-storage, 10ops-eqiad: Power drain and restart of ms-be1059 - https://phabricator.wikimedia.org/T307667 (10Cmjohnson) @MatthewVernon okay, I will submit a ticket with HPE and see how it goes. On the plus side, it did eventually power up but there are some problems. Will not boot to the d... [16:26:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T307525)', diff saved to https://phabricator.wikimedia.org/P27748 and previous config saved to /var/cache/conftool/dbconfig/20220505-162617-ladsgroup.json [16:26:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:25] 10SRE, 10Analytics, 10Data-Engineering: Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10CDanis) How hard is option 1? I'm starting to think up use cases for NEL data like comparing the ratio of reports/time vs webrequests/time for a given... [16:26:25] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [16:26:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10Cmjohnson) @robh where are you with testing these? Just wondering if we can try and get these imaged and off the workboard. [16:27:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10Cmjohnson) a:05Cmjohnson→03RobH @RobH assigning to you until ready to pass back [16:27:15] 10SRE, 10SRE-swift-storage, 10ops-eqiad: Power drain and restart of ms-be1059 - https://phabricator.wikimedia.org/T307667 (10MatthewVernon) That's ... not inspiring optimism, is it? :( Thanks for the update, and for looking at this quickly! [16:29:44] (03PS3) 10Bking: elasticsearch: Java version is a fact, does not need to be a param [puppet] - 10https://gerrit.wikimedia.org/r/789644 (https://phabricator.wikimedia.org/T289135) [16:30:18] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: Java version is a fact, does not need to be a param [puppet] - 10https://gerrit.wikimedia.org/r/789644 (https://phabricator.wikimedia.org/T289135) (owner: 10Bking) [16:31:00] (03PS4) 10Bking: elasticsearch: Java version is a fact, does not need to be a param [puppet] - 10https://gerrit.wikimedia.org/r/789644 (https://phabricator.wikimedia.org/T289135) [16:32:32] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/789644 (https://phabricator.wikimedia.org/T289135) (owner: 10Bking) [16:33:19] (03CR) 10Dzahn: "Thank you. I remember it was missing in a list of "alternate_domains" which caused some unexpected results in https://phabricator.wikimedi" [puppet] - 10https://gerrit.wikimedia.org/r/789561 (https://phabricator.wikimedia.org/T266509) (owner: 10Vgutierrez) [16:33:41] (03CR) 10Ayounsi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/789634 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [16:35:35] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:35:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:44] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission cloudvirt1018.eqiad.wmnet - https://phabricator.wikimedia.org/T296790 (10Cmjohnson) [16:38:28] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission cloudvirt1018.eqiad.wmnet - https://phabricator.wikimedia.org/T296790 (10Cmjohnson) 05Open→03Resolved happy to remove cloudvirt1018! many bad memories [16:38:41] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P27749 and previous config saved to /var/cache/conftool/dbconfig/20220505-164122-ladsgroup.json [16:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:59] !log razzi@cumin1001 START - Cookbook sre.kafka.reboot-workers for Kafka main-codfw cluster: Reboot kafka nodes [16:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:22] (03CR) 10CDanis: [C: 03+2] Expand stick-table test to three other hosts [puppet] - 10https://gerrit.wikimedia.org/r/789219 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis) [16:47:42] !log ebysans@deploy1002 Started deploy [airflow-dags/analytics@ebbdbb6]: (no justification provided) [16:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:52] !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics@ebbdbb6]: (no justification provided) (duration: 00m 09s) [16:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:11] !log herron@cumin1001 START - Cookbook sre.kafka.reboot-workers for Kafka main-eqiad cluster: Reboot kafka nodes [16:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:26] (03PS2) 10Btullis: Increase the total number of HDFS files before the alert triggers [alerts] - 10https://gerrit.wikimedia.org/r/789618 (https://phabricator.wikimedia.org/T307549) [16:56:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P27750 and previous config saved to /var/cache/conftool/dbconfig/20220505-165627-ladsgroup.json [16:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:01:05] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake certificates and keys for etcd-v3.eqiad and etcd-v3.codfw [labs/private] - 10https://gerrit.wikimedia.org/r/788439 (https://phabricator.wikimedia.org/T307382) (owner: 10Dzahn) [17:03:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:05:13] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 2 (backup1002, ...), Fresh: 106 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [17:09:54] (03PS1) 10Majavah: kubernetes: fix DEFAULT_JDK_RESOURCES [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/789655 (https://phabricator.wikimedia.org/T307693) [17:09:58] (03CR) 10Btullis: [C: 03+2] Increase the total number of HDFS files before the alert triggers [alerts] - 10https://gerrit.wikimedia.org/r/789618 (https://phabricator.wikimedia.org/T307549) (owner: 10Btullis) [17:10:12] (03CR) 10Majavah: [C: 03+2] kubernetes: fix DEFAULT_JDK_RESOURCES [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/789655 (https://phabricator.wikimedia.org/T307693) (owner: 10Majavah) [17:11:07] (03Merged) 10jenkins-bot: kubernetes: fix DEFAULT_JDK_RESOURCES [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/789655 (https://phabricator.wikimedia.org/T307693) (owner: 10Majavah) [17:11:18] (03PS1) 10Majavah: d/changelog: Prepare for 0.83 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/789656 [17:11:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T307525)', diff saved to https://phabricator.wikimedia.org/P27751 and previous config saved to /var/cache/conftool/dbconfig/20220505-171132-ladsgroup.json [17:11:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [17:11:35] (03CR) 10Majavah: [C: 03+2] d/changelog: Prepare for 0.83 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/789656 (owner: 10Majavah) [17:11:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [17:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:37] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [17:11:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T307525)', diff saved to https://phabricator.wikimedia.org/P27752 and previous config saved to /var/cache/conftool/dbconfig/20220505-171140-ladsgroup.json [17:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:02] (03Merged) 10jenkins-bot: Increase the total number of HDFS files before the alert triggers [alerts] - 10https://gerrit.wikimedia.org/r/789618 (https://phabricator.wikimedia.org/T307549) (owner: 10Btullis) [17:12:55] (03Merged) 10jenkins-bot: d/changelog: Prepare for 0.83 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/789656 (owner: 10Majavah) [17:17:49] (03PS1) 10BBlack: Explicitly define wikiworkshop ServerName as HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/789658 (https://phabricator.wikimedia.org/T251732) [17:20:32] !log phabricator - believe it or not - disabling the last active SUBVERSION repository in Diffusion (https://phabricator.wikimedia.org/diffusion/TSVN) [17:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:33] (03CR) 10Dzahn: "I deactivated https://phabricator.wikimedia.org/diffusion/TSVN just now (being bold, this is for toolserver, the thing before wm cloud ev" [puppet] - 10https://gerrit.wikimedia.org/r/785149 (owner: 10Muehlenhoff) [17:27:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T307525)', diff saved to https://phabricator.wikimedia.org/P27753 and previous config saved to /var/cache/conftool/dbconfig/20220505-172758-ladsgroup.json [17:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:04] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [17:28:47] (03CR) 10Dzahn: [C: 03+2] No longer install subversion on Phabricator hosts [puppet] - 10https://gerrit.wikimedia.org/r/785149 (owner: 10Muehlenhoff) [17:28:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [17:31:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [17:33:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [17:36:08] 10SRE, 10Privacy Engineering, 10Research, 10Security-Team, and 3 others: wikiworkshop.org has Facebook button, external statcounter, https to http redirect - https://phabricator.wikimedia.org/T251732 (10BBlack) >>! In T251732#7892986, @bmansurov wrote: > 2. Resolve the https -> http redirect issue (who sho... [17:36:35] !log phab1001 - apt-get remove subversion [17:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [17:38:48] (03CR) 10Dzahn: [C: 03+2] "removed the subversion package from phab* servers manually" [puppet] - 10https://gerrit.wikimedia.org/r/785149 (owner: 10Muehlenhoff) [17:40:38] (03CR) 10Dzahn: [C: 03+2] "also see https://debmonitor.wikimedia.org/packages/subversion now" [puppet] - 10https://gerrit.wikimedia.org/r/785149 (owner: 10Muehlenhoff) [17:43:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P27754 and previous config saved to /var/cache/conftool/dbconfig/20220505-174304-ladsgroup.json [17:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:29] 10SRE, 10serviceops: Provide node14 images for running production node-based services - https://phabricator.wikimedia.org/T306996 (10bd808) >>! In T306996#7906981, @MoritzMuehlenhoff wrote: > I've added repo sync definitions for node 14 on Bullseye (and imported 14.19.2 into it). Thank you! > I suppose 16 is... [17:51:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:58:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P27755 and previous config saved to /var/cache/conftool/dbconfig/20220505-175809-ladsgroup.json [17:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] hashar and brennen: #bothumor I � Unicode. All rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220505T1800). [18:01:26] train is complete. All wikis are at wmf.10 [18:02:17] hashar: kudos [18:02:30] hashar: seems like a good time to mess with contint* servers now [18:03:30] mutante: for what? [18:03:41] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1001/35113/contint2001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/768774 (https://phabricator.wikimedia.org/T300682) (owner: 10Dduvall) [18:04:03] hashar: you wanted newer docker (docker-ce instead of docker.io) ^ [18:04:25] mutante: o/ [18:04:30] dduvall: [18:04:33] https://phabricator.wikimedia.org/T300682 [18:04:45] (where's the bot) [18:05:02] sleep()'ing on the job [18:05:04] yes for Dan. Please do it together. I think it was for BuildKit and it is above my knowledge [18:05:50] !log contint1001 - disabled puppet [18:05:52] buildkit bugfixes and just to be congruent with other ci docker hosts [18:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:03] what are the other hosts? [18:06:19] the integration-docker vpses [18:06:38] they shouldn't be affected by this deployment, however [18:06:43] ok, compiler doesnt know about those [18:07:02] they don't use profile::ci::docker or do they? [18:07:35] looks at https://openstack-browser.toolforge.org/puppetclass/ [18:07:49] Puppet Class: profile::ci::slave::labs::common [18:07:53] that's all I see [18:08:21] can't find the string "docker" in there though [18:08:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) Sorry about this! I'll take back over dumpsdata1006, but we'll need to modify partman recipes for this to work so this particular task will stay open... [18:08:41] do you know how docker is installed there? [18:09:18] they already have newer versions of docker due to running bullseye [18:09:34] mutante: role::ci::slave::labs::docker I think? https://openstack-browser.toolforge.org/server/integration-agent-docker-1025.integration.eqiad1.wikimedia.cloud [18:09:45] and this change only affects hosts running `debian::codename::lt('bullseye')` [18:10:15] ah, yes, thanks bd808 https://openstack-browser.toolforge.org/puppetclass/role::ci::slave::labs::docker [18:10:59] so.. this DOES use profile::ci::docker [18:10:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) [18:11:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) [18:11:07] gotta compile it on one of them [18:11:28] yes, that's a good idea [18:11:30] would be nice if the compiler could find those [18:11:31] should result in a noop [18:11:34] when giving the class name [18:11:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) [18:11:53] Hosts that have failed to compile completely [18:11:54] integration-agent-docker-1023.integration.eqiad1.wikimedia.cloud [18:11:56] fails [18:12:01] grr [18:12:05] * dduvall looks [18:12:13] 404 Not Found.. sigh [18:12:47] dduvall: can you get on those instances? [18:12:54] yes [18:13:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T307525)', diff saved to https://phabricator.wikimedia.org/P27756 and previous config saved to /var/cache/conftool/dbconfig/20220505-181314-ladsgroup.json [18:13:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [18:13:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [18:13:18] well, let's do this. I disable puppet on contint* in prod. Merge it. You check if it's noop on integration [18:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:19] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [18:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:01] ok. shall i disable puppet and run it manually or is there a better way? [18:14:33] * dduvall does [18:14:49] dduvall: if you have an easy way to disable it on ALL of them, you can optionally do that.. but .. if not.. let's just do it anyways [18:15:03] let's roll [18:15:13] (03CR) 10Dzahn: [C: 03+2] Revert "Revert "contint: Install docker 20.10 from thirdparty/ci on buster"" [puppet] - 10https://gerrit.wikimedia.org/r/768774 (https://phabricator.wikimedia.org/T300682) (owner: 10Dduvall) [18:15:25] ok, disabled puppet on contint* for now [18:15:53] disabled puppet on integration-agent-docker-1023.integration.eqiad1.wikimedia.cloud as well [18:16:06] dduvall: alright, just run puppet on a random one. but confirm it already synced and really applies that change [18:16:20] (to the local p-master) [18:16:38] ah, right [18:16:44] let me check the puppet master first [18:16:47] it should tell you on the console though [18:16:53] which change it applies [18:18:44] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1006.eqiad.wmnet with OS bullseye [18:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:48] 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye [18:19:37] mutante: looks good [18:19:57] https://www.irccloud.com/pastebin/0uOw2azf/ [18:20:23] change is there on the pm and it's a noop [18:20:35] dduvall: ok, great. then I will do contint2001 next [18:20:51] running puppet there [18:21:18] Warning: /Stage[main]/Helm/File[/var/cache/helm/repository]: Skipping because of failed dependencies [18:21:36] but maybe this is gone on next run.. we will see [18:22:23] Error: /Stage[main]/Helm/File[/var/cache/helm]/group: change from 'wikidev' to 'deployment' failed: Could not find group deployment [18:22:42] E: Version '5:20.10.9~3-0~debian-buster' for 'docker-ce' was not found [18:22:46] nope, doesn't work [18:22:51] :( [18:23:31] W: Target Sources (thirdparty/ci/source/Sources) is configured multiple times in /etc/apt/sources.list.d/repository_jenkins-thirdparty-ci.list [18:23:38] looking for remnants [18:25:21] looks like the version may have been bumped since this was attempted last [18:26:30] mutante: i'm looking at https://apt.wikimedia.org/wikimedia/dists/buster-wikimedia/thirdparty/ci/binary-amd64/Packages and it appears the version is newer than what is specified in the manifest [18:26:52] so perhaps the version was bumped since the patch was authored [18:27:06] !log contint2001 - deleting /etc/apt/sources.list.d/repository_jenkins-thirdparty-ci.list is identical to thirdparty-ci.list . deleting the former to avoid duplicate definition warnings [18:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:36] there are 2 list files with the same content and puppet recreates them both [18:27:49] could be ignored but still warnings because of that [18:27:56] dduvall: makes sense [18:28:12] shall i amend the patch? [18:28:44] yes please [18:30:24] [apt1001:~] $ sudo -i reprepro ls docker-ce [18:30:24] docker-ce | 5:19.03.5~3-0~debian-stretch | stretch-wikimedia | amd64 [18:30:27] docker-ce | 18.06.3~ce~3-0~debian | stretch-wikimedia | amd64 [18:30:30] docker-ce | 5:20.10.12~3-0~debian-buster | buster-wikimedia | amd64 [18:30:33] docker-ce | 5:20.10.8~3-0~debian-buster | buster-wikimedia | amd64 [18:30:36] ^ [18:30:50] jouncebot: nowandnext [18:30:50] For the next 1 hour(s) and 29 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220505T1800) [18:30:50] In 1 hour(s) and 29 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220505T2000) [18:31:15] (03PS1) 10Dduvall: contint: Bump docker 20.10 version for thirdparty/ci on buster [puppet] - 10https://gerrit.wikimedia.org/r/789668 (https://phabricator.wikimedia.org/T300682) [18:31:21] uhmm.. there is a train now, dduvall [18:31:28] I gather the train is already done, would you mind if I deploy stuff dduvall mutante ? [18:31:45] train is done i believe, right hashar ? [18:31:48] Amir1: we wouldn't care about deploys to appservers but I wouldn't realy that CI works [18:32:08] The train is done [18:32:12] awesome [18:32:16] dancy: thanks [18:32:18] (03PS2) 10Ladsgroup: Stop writing to temp actor table in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789558 (https://phabricator.wikimedia.org/T275246) [18:32:21] (03CR) 10Ladsgroup: [C: 03+2] Stop writing to temp actor table in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789558 (https://phabricator.wikimedia.org/T275246) (owner: 10Ladsgroup) [18:32:55] (03CR) 10jerkins-bot: [V: 04-1] contint: Bump docker 20.10 version for thirdparty/ci on buster [puppet] - 10https://gerrit.wikimedia.org/r/789668 (https://phabricator.wikimedia.org/T300682) (owner: 10Dduvall) [18:33:08] (03Merged) 10jenkins-bot: Stop writing to temp actor table in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789558 (https://phabricator.wikimedia.org/T275246) (owner: 10Ladsgroup) [18:33:31] (03PS2) 10Dzahn: contint: Bump docker 20.10 version for thirdparty/ci on buster [puppet] - 10https://gerrit.wikimedia.org/r/789668 (https://phabricator.wikimedia.org/T300682) (owner: 10Dduvall) [18:33:33] (03PS3) 10Dduvall: contint: Bump docker 20.10 version for thirdparty/ci on buster [puppet] - 10https://gerrit.wikimedia.org/r/789668 (https://phabricator.wikimedia.org/T300682) [18:33:58] (03CR) 10Dzahn: [C: 03+2] "apt1001:~] $ sudo -i reprepro ls docker-ce" [puppet] - 10https://gerrit.wikimedia.org/r/789668 (https://phabricator.wikimedia.org/T300682) (owner: 10Dduvall) [18:34:37] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:789558|Stop writing to temp actor table in group0 (T275246)]] (duration: 00m 50s) [18:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:42] T275246: Populate rev_actor and rev_comment_id - https://phabricator.wikimedia.org/T275246 [18:35:04] (03CR) 10Dzahn: [V: 03+2] contint: Bump docker 20.10 version for thirdparty/ci on buster [puppet] - 10https://gerrit.wikimedia.org/r/789668 (https://phabricator.wikimedia.org/T300682) (owner: 10Dduvall) [18:36:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:36:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:53] (03PS6) 10Aaron Schulz: Add "db-mainstash" entry to $wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752807 (https://phabricator.wikimedia.org/T212129) [18:37:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:37:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:38:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:44] (03CR) 10Aaron Schulz: Add "db-mainstash" entry to $wgObjectCaches (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752807 (https://phabricator.wikimedia.org/T212129) (owner: 10Aaron Schulz) [18:38:58] !log robh@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dumpsdata1006.eqiad.wmnet with OS bullseye [18:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:02] 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye executed with errors: - dumpsdata1006 (**FAIL**) - Removed from Puppet and PuppetD... [18:39:10] 10SRE, 10SRE-Access-Requests, 10Community-Tech: Grant Access to PII in Superset for HMonroy and DMaza - https://phabricator.wikimedia.org/T307737 (10HMonroy) a:03MMandere [18:39:28] the good part: [18:39:30] Notice: /Stage[main]/Profile::Ci::Docker/Package[docker-ce]/ensure: created [18:39:34] the bad part: [18:39:46] :| [18:39:47] all kinds of dependency issues that seem to have existed before we did anything [18:40:00] but at least the puppet run does finish [18:40:25] if you don't mind filing a bug with the log i can take a look and do a follow-up [18:40:35] dduvall: now now we have docker-ce installed! [18:40:40] and docker.io is "rc" [18:40:46] thank you very much! [18:40:57] let me actually purge that fully [18:41:09] 10SRE, 10SRE-Access-Requests, 10Community-Tech: Grant Access to PII in Superset for HMonroy and DMaza - https://phabricator.wikimedia.org/T307737 (10HMonroy) a:05MMandere→03None [18:42:01] !log contitn2001 - apt-get remove --purge docker.io after docker-ce was installed by puppet for T300682 [18:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:06] T300682: contint1001 and contint2001 need a newer version of Docker installed - https://phabricator.wikimedia.org/T300682 [18:43:37] docker version [18:43:37] Client: Docker Engine - Community Version: 20.10.12 [18:43:41] dduvall: ^ [18:43:56] very nice [18:44:18] bd808: ^ :) [18:44:20] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for Ariel Gutman - https://phabricator.wikimedia.org/T307582 (10dr0ptp4kt) I don't see a need for production shell access at this time. I do believe it's likely the case that @AGutman-WMF will need analytics cluster access for act... [18:45:42] now doing the same for contint1001 [18:46:11] (03PS2) 10Ladsgroup: Set cebwiki to read new in templatelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789562 (https://phabricator.wikimedia.org/T306673) [18:46:15] (03CR) 10Ladsgroup: [C: 03+2] Set cebwiki to read new in templatelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789562 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup) [18:47:01] (03Merged) 10jenkins-bot: Set cebwiki to read new in templatelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789562 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup) [18:47:15] mutante: `docker version` looks good [18:47:17] (03CR) 10JHathaway: admin: Add Ariel Gutman to LDAP only accounts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/789288 (https://phabricator.wikimedia.org/T307582) (owner: 10JHathaway) [18:47:20] (03CR) 10JHathaway: [C: 03+2] admin: Add Ariel Gutman to LDAP only accounts [puppet] - 10https://gerrit.wikimedia.org/r/789288 (https://phabricator.wikimedia.org/T307582) (owner: 10JHathaway) [18:47:49] !log razzi@cumin1001 END (PASS) - Cookbook sre.kafka.reboot-workers (exit_code=0) for Kafka main-codfw cluster: Reboot kafka nodes [18:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:33] dduvall: the part I don't like is that contint2001 is not identical to contint1001 which suggests manual changes happened in the past but just on one server [18:48:44] what's different? [18:48:47] like there is no /var/lib/docker on 1001 [18:48:55] which was removed by purging the old package on 2001 [18:49:01] is there a `/srv/docker` on contint2001? [18:49:04] right [18:49:04] here it says that /etc/docker is not empty [18:49:08] so not removed [18:49:16] it didnt say that on the other host [18:49:32] i believe both should be configured to use `/srv/docker` as the docker data dir [18:49:35] there is /srv/docker on both [18:49:46] ok, that's the important part [18:50:21] alright, so.. done from my side [18:50:27] can we do some tests? [18:51:05] we could re-run a job that publishes an image [18:51:16] those should always land on contint servers [18:51:30] 10SRE, 10SRE-Access-Requests, 10Community-Tech: Grant Access to PII in Superset for HMonroy and Dmaza - https://phabricator.wikimedia.org/T307737 (10dmaza) [18:51:31] !log contitn1001 - apt-get remove --purge docker.io after docker-ce was installed by puppet for T300682 (different behaviour from contint2001 since it did not have /var/lib/docker) [18:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:36] T300682: contint1001 and contint2001 need a newer version of Docker installed - https://phabricator.wikimedia.org/T300682 [18:51:53] (03PS5) 10Gehel: elasticsearch: Java version is a fact, does not need to be a param [puppet] - 10https://gerrit.wikimedia.org/r/789644 (https://phabricator.wikimedia.org/T289135) (owner: 10Bking) [18:51:54] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:789562|Set cebwiki to read new in templatelinks migration (T306673)]] (duration: 00m 49s) [18:51:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:58] T306673: Turn on read new for templatelinks on beta and production - https://phabricator.wikimedia.org/T306673 [18:53:00] mutante: so far so good https://integration.wikimedia.org/ci/job/blubber-pipeline-rehearse/100/console [18:53:13] landed on contint1001 and has successfully run multiple containers already [18:53:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:35] !log herron@cumin1001 END (PASS) - Cookbook sre.kafka.reboot-workers (exit_code=0) for Kafka main-eqiad cluster: Reboot kafka nodes [18:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:49] mutante: thanks for the deployment and sorry for the cruft on contint* servers [18:54:07] dduvall: cool! (when doing an 'docker images' I see a lot of lines with "". don't know if that is normal [18:54:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:54:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:54:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:58] I'm done, ping me if things go sideways [18:55:06] mutante: that's normal [18:55:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:21] Amir1: you're good to go from my perspective [18:55:22] alright! i think we are done then:) updating the ticket [18:55:29] thanks again :) [18:55:54] yep, yw! [18:57:42] (03PS1) 10Slyngshede: Replace crontab with systemd timers for Postgresql dump [puppet] - 10https://gerrit.wikimedia.org/r/789677 (https://phabricator.wikimedia.org/T273673) [18:58:09] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/789644 (https://phabricator.wikimedia.org/T289135) (owner: 10Bking) [18:58:18] (03CR) 10jerkins-bot: [V: 04-1] Replace crontab with systemd timers for Postgresql dump [puppet] - 10https://gerrit.wikimedia.org/r/789677 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [18:58:27] (03CR) 10Dzahn: "also see https://gerrit.wikimedia.org/r/c/operations/puppet/+/777433" [puppet] - 10https://gerrit.wikimedia.org/r/789677 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [18:59:40] dduvall: https://phabricator.wikimedia.org/T307740 [19:00:37] (03PS6) 10Gehel: elasticsearch: Java version is a fact, does not need to be a param [puppet] - 10https://gerrit.wikimedia.org/r/789644 (https://phabricator.wikimedia.org/T289135) (owner: 10Bking) [19:00:39] (03PS1) 10Gehel: elasticsearch: cleanup indentation to follow usual style [puppet] - 10https://gerrit.wikimedia.org/r/789678 [19:01:01] mutante: thanks. i'll have a look this week [19:01:29] maybe it's related to https://phabricator.wikimedia.org/T303857 [19:01:34] thanks! ack [19:01:47] (03PS2) 10Slyngshede: Replace crontab with systemd timers for Postgresql dump [puppet] - 10https://gerrit.wikimedia.org/r/789677 (https://phabricator.wikimedia.org/T273673) [19:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:02:12] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops: contint: puppet - Could not find group deployment - https://phabricator.wikimedia.org/T307740 (10Dzahn) [19:03:44] (03PS7) 10Gehel: elasticsearch: cleanup indentation to follow usual style [puppet] - 10https://gerrit.wikimedia.org/r/789644 (owner: 10Bking) [19:03:46] (03PS2) 10Gehel: elasticsearch: Java version is a fact, does not need to be a param [puppet] - 10https://gerrit.wikimedia.org/r/789678 (https://phabricator.wikimedia.org/T289135) [19:08:13] (03PS3) 10Gehel: elasticsearch: cleanup indentation to follow usual style [puppet] - 10https://gerrit.wikimedia.org/r/789678 [19:08:15] (03PS8) 10Gehel: elasticsearch: Java version is a fact, does not need to be a param [puppet] - 10https://gerrit.wikimedia.org/r/789644 (https://phabricator.wikimedia.org/T289135) (owner: 10Bking) [19:11:19] (03CR) 10Ryan Kemper: [C: 03+1] elasticsearch: cleanup indentation to follow usual style [puppet] - 10https://gerrit.wikimedia.org/r/789678 (owner: 10Gehel) [19:11:53] (03CR) 10Gehel: [C: 03+2] elasticsearch: cleanup indentation to follow usual style [puppet] - 10https://gerrit.wikimedia.org/r/789678 (owner: 10Gehel) [19:12:38] (03CR) 10Ssingh: [V: 03+1 C: 03+2] durum: add monitoring::service for the check service [puppet] - 10https://gerrit.wikimedia.org/r/789646 (owner: 10Ssingh) [19:13:08] (03PS9) 10Gehel: elasticsearch: Java version is a fact, does not need to be a param [puppet] - 10https://gerrit.wikimedia.org/r/789644 (https://phabricator.wikimedia.org/T289135) (owner: 10Bking) [19:15:09] (03PS10) 10Gehel: elasticsearch: Java version is a fact, does not need to be a param [puppet] - 10https://gerrit.wikimedia.org/r/789644 (https://phabricator.wikimedia.org/T289135) (owner: 10Bking) [19:17:40] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for Ariel Gutman - https://phabricator.wikimedia.org/T307582 (10jhathaway) 05Open→03Resolved a:03jhathaway @AGutman-WMF you have been added to the wmf group, please reopen if there are any issues! [19:21:31] dduvall: yay! I'll find my reverted feature patch and try to resurrect it later today. :) [19:23:43] 10SRE, 10Analytics, 10Data-Engineering: Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10Ottomata) Not hard at all, there is plenty of puppet to support it. Just need to run it somewhere. We currently colocate MirrorMaker on target cluste... [19:42:14] (03PS1) 10JMeybohm: WIP: Add a cookbook for rolling reboot of k8s clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 [19:43:27] (03PS6) 10EllenR: Set log level to 'debug' for mediamoderation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788777 (https://phabricator.wikimedia.org/T303312) [19:44:33] 10SRE, 10SRE-Access-Requests, 10Generated Data Platform: Request to add user fkaelin to analytics-platform-eng-admins group - https://phabricator.wikimedia.org/T307573 (10jhathaway) [19:44:36] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [19:45:01] (03PS2) 10JMeybohm: WIP: Add a cookbook for rolling reboot of k8s clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 [19:45:38] (03PS1) 10JHathaway: admin: add Fabian Kaelin to analytics-platform-eng-admins [puppet] - 10https://gerrit.wikimedia.org/r/789681 (https://phabricator.wikimedia.org/T307573) [19:46:34] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [19:46:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:47:40] (03PS7) 10Mepps: Set log level to 'debug' for mediamoderation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788777 (https://phabricator.wikimedia.org/T303312) (owner: 10EllenR) [19:48:10] 10SRE, 10SRE-Access-Requests, 10Generated Data Platform, 10Patch-For-Review: Request to add user fkaelin to analytics-platform-eng-admins group - https://phabricator.wikimedia.org/T307573 (10jhathaway) @WDoranWMF patch cut, if you could explicitly approve as a comment that would be appreciated, though I ta... [19:48:44] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/789681 (https://phabricator.wikimedia.org/T307573) (owner: 10JHathaway) [19:55:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:56:09] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:57:03] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.273 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:58:15] (03CR) 10Mepps: "Ignore my commit. I didn't see the rebase was complete." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788777 (https://phabricator.wikimedia.org/T303312) (owner: 10EllenR) [20:00:05] brennen: Your horoscope predicts another unfortunate UTC late backport and config training deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220505T2000). [20:00:05] koi: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:20] I'm here [20:01:33] hey koi looking through your patches now [20:02:18] (03PS2) 10Stang: urwiki: allow "sysop" to add/remove "eliminator" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789630 (https://phabricator.wikimedia.org/T307029) [20:03:08] (03CR) 10Herron: [C: 03+1] prometheus: remove high NEL alert, moved to alerts.git [puppet] - 10https://gerrit.wikimedia.org/r/789152 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [20:04:14] 10SRE, 10SRE-Access-Requests, 10Community-Tech: Grant Access to PII in Superset for HMonroy and Dmaza - https://phabricator.wikimedia.org/T307737 (10jhathaway) @HMonroy happy to help grant superset access, but I am not sure exactly how to do that? This ticket, T283190, appears similar, is that what group you... [20:04:22] 10SRE, 10SRE-Access-Requests, 10Community-Tech: Grant Access to PII in Superset for HMonroy and Dmaza - https://phabricator.wikimedia.org/T307737 (10jhathaway) a:03jhathaway [20:05:41] 10SRE, 10SRE-Access-Requests, 10Generated Data Platform, 10Patch-For-Review: Request to add user fkaelin to analytics-platform-eng-admins group - https://phabricator.wikimedia.org/T307573 (10jhathaway) a:05hnowlan→03jhathaway [20:10:50] !log razzi@cumin1001 START - Cookbook sre.kafka.reboot-workers for Kafka main-eqiad cluster: Reboot kafka nodes [20:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:21] hi thcipriani wondering any progress? [20:12:42] koi: sorry, I'm fiddling on the deployment server, trying a bit of a new process, sorry for the delay :( [20:20:26] !log thcipriani@deploy1002 backport aborted: (duration: 00m 02s) [20:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:29] !log thcipriani@deploy1002 backport aborted: (duration: 00m 41s) [20:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:11] (03CR) 10Thcipriani: [C: 03+2] "Approved via scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789630 (https://phabricator.wikimedia.org/T307029) (owner: 10Stang) [20:23:25] wee working [20:23:56] (03Merged) 10jenkins-bot: urwiki: allow "sysop" to add/remove "eliminator" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789630 (https://phabricator.wikimedia.org/T307029) (owner: 10Stang) [20:25:49] ^ koi that one is live on mwdebug1002, check please [20:25:57] looking [20:26:07] (03PS7) 10Aaron Schulz: Add "db-mainstash" entry to $wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752807 (https://phabricator.wikimedia.org/T212129) [20:26:20] lgtm [20:26:30] thanks for verifying, syncing [20:27:26] (03CR) 10jerkins-bot: [V: 04-1] Add "db-mainstash" entry to $wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752807 (https://phabricator.wikimedia.org/T212129) (owner: 10Aaron Schulz) [20:28:29] !log thcipriani@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:789630|urwiki: allow "sysop" to add/remove "eliminator" (T307029)]] (duration: 00m 49s) [20:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:34] T307029: Allow admins on ur.wiki to add/remove users to eliminator group - https://phabricator.wikimedia.org/T307029 [20:28:39] ^ should be live everywhere now [20:29:13] (03CR) 10Thcipriani: [C: 03+2] Suppress "named" group when TempUser system is disabled [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/789332 (https://phabricator.wikimedia.org/T307675) (owner: 10Stang) [20:29:17] (03CR) 10Thcipriani: [C: 03+2] Add messages for the "named" user group [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/789333 (https://phabricator.wikimedia.org/T307675) (owner: 10Stang) [20:29:19] yeah confirmed [20:30:21] (waiting on CI for the others) [20:31:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:32:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:32:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:33:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:49] koi: i'll be taking over here once these merge, doing a single sync-world to cover both. [20:41:27] Got it, so no need to check on mwdebug1002 for backport? [20:42:21] still waiting on CI, i'll let you know when the non-localization one is checkable [20:46:44] (03Merged) 10jenkins-bot: Suppress "named" group when TempUser system is disabled [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/789332 (https://phabricator.wikimedia.org/T307675) (owner: 10Stang) [20:46:50] (03Merged) 10jenkins-bot: Add messages for the "named" user group [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/789333 (https://phabricator.wikimedia.org/T307675) (owner: 10Stang) [20:48:35] koi: on mwdebug1002 for testing now [20:48:45] looking [20:48:46] whoops, one sec [20:48:49] missed a rebase [20:49:16] koi: ok, checkable now [20:50:09] LGTM, I see no "name" inside Member of groups field on Special:Preferences [20:50:40] koi: cool, syncing. [20:51:28] !log brennen@deploy1002 Started scap: Backport: Revert: [[gerrit:789333|Add messages for the "named" user group (T307675)]] and Backport: [[gerrit:789332|Suppress "named" group when TempUser system is disabled (T307675)]] [20:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:34] T307675: Mysterious "named" user group - https://phabricator.wikimedia.org/T307675 [20:53:17] (03PS1) 10Andrew Bogott: Horizon: include openstack bpos on cloudweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/789687 (https://phabricator.wikimedia.org/T307561) [20:53:36] !log sync of last patch ongoing, otherwise closing UTC late backport and config window [20:53:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:50] (03CR) 10jerkins-bot: [V: 04-1] Horizon: include openstack bpos on cloudweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/789687 (https://phabricator.wikimedia.org/T307561) (owner: 10Andrew Bogott) [20:55:10] (03PS2) 10Andrew Bogott: Horizon: include openstack bpos on cloudweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/789687 (https://phabricator.wikimedia.org/T307561) [20:56:53] (03CR) 10jerkins-bot: [V: 04-1] Horizon: include openstack bpos on cloudweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/789687 (https://phabricator.wikimedia.org/T307561) (owner: 10Andrew Bogott) [20:58:20] ::sigh:: - something here appears to be causing a fatal [21:00:18] 0_o [21:03:22] !log brennen@deploy1002 sync-world aborted: Backport: Revert: [[gerrit:789333|Add messages for the "named" user group (T307675)]] and Backport: [[gerrit:789332|Suppress "named" group when TempUser system is disabled (T307675)]] (duration: 11m 53s) [21:03:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:27] T307675: Mysterious "named" user group - https://phabricator.wikimedia.org/T307675 [21:04:14] lgtm [21:05:11] !log brennen@deploy1002 Synchronized php-1.39.0-wmf.10/includes/user: Backport: Revert: [[gerrit:789332|Suppress "named" group when TempUser system is disabled (T307675)]] (duration: 00m 50s) [21:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:46] !log reboot mx2001 for kernel update [21:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:50] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/789690 [21:06:12] aha? brennen, what happened [21:06:27] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 325 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:06:45] (03CR) 10Dduvall: [C: 03+2] blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/789690 (owner: 10PipelineBot) [21:06:53] a number of fatals from TempUserCreator & UserGroupManager [21:07:12] that's reverted; i'll file a bug against it and re-do the sync-world to ensure localization is where expected. [21:08:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:43] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:08:47] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mx2001.wikimedia.org with reason: new kernel [21:08:48] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx2001.wikimedia.org with reason: new kernel [21:08:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:00] (03Merged) 10jenkins-bot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/789690 (owner: 10PipelineBot) [21:11:02] !log dduvall@deploy1002 helmfile [staging] START helmfile.d/services/blubberoid: apply [21:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:38] !log dduvall@deploy1002 helmfile [staging] DONE helmfile.d/services/blubberoid: apply [21:11:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:12:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:31] (03PS1) 10Brennen Bearnes: Revert "Suppress "named" group when TempUser system is disabled" [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/789692 [21:15:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:22] (03CR) 10Brennen Bearnes: [C: 03+2] "This revert is deployed." [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/789692 (owner: 10Brennen Bearnes) [21:17:30] !log dduvall@deploy1002 helmfile [codfw] START helmfile.d/services/blubberoid: apply [21:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:09] !log dduvall@deploy1002 helmfile [codfw] DONE helmfile.d/services/blubberoid: apply [21:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:22] !log dduvall@deploy1002 helmfile [eqiad] START helmfile.d/services/blubberoid: apply [21:18:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:52] !log dduvall@deploy1002 helmfile [eqiad] DONE helmfile.d/services/blubberoid: apply [21:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:58] (03CR) 10Brennen Bearnes: [C: 03+2] "I deployed this backport earlier and was getting (output from logspam-watch):" [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/789692 (owner: 10Brennen Bearnes) [21:20:19] this may have been a file order sync artifact. [21:21:31] !log reboot mx1001 [21:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:37] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mx1001.wikimedia.org with reason: new kernel [21:21:38] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx1001.wikimedia.org with reason: new kernel [21:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:40] !log brennen@deploy1002 Started scap: Resuming previously interrupted sync-world [21:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:22] (03PS3) 10Andrew Bogott: Horizon: include openstack bpos on cloudweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/789687 (https://phabricator.wikimedia.org/T307561) [21:24:56] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mirror1001.wikimedia.org with reason: new kernel [21:24:57] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mirror1001.wikimedia.org with reason: new kernel [21:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:06] (03CR) 10jerkins-bot: [V: 04-1] Horizon: include openstack bpos on cloudweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/789687 (https://phabricator.wikimedia.org/T307561) (owner: 10Andrew Bogott) [21:26:27] !log brennen@deploy1002 Finished scap: Resuming previously interrupted sync-world (duration: 03m 47s) [21:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:58] (03CR) 10Brennen Bearnes: [C: 04-2] "Yep, looks like sync order, will try this again." [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/789692 (owner: 10Brennen Bearnes) [21:28:04] (03PS4) 10Andrew Bogott: Horizon: include openstack bpos on cloudweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/789687 (https://phabricator.wikimedia.org/T307561) [21:28:22] (03Abandoned) 10Brennen Bearnes: Revert "Suppress "named" group when TempUser system is disabled" [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/789692 (owner: 10Brennen Bearnes) [21:29:47] (03CR) 10jerkins-bot: [V: 04-1] Horizon: include openstack bpos on cloudweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/789687 (https://phabricator.wikimedia.org/T307561) (owner: 10Andrew Bogott) [21:30:31] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Horizon: include openstack bpos on cloudweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/789687 (https://phabricator.wikimedia.org/T307561) (owner: 10Andrew Bogott) [21:33:44] !log brennen@deploy1002 scap failed: average error rate on 7/9 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org for details) [21:33:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:21] !log brennen@deploy1002 Synchronized php-1.39.0-wmf.10/includes/user: Backport: [[gerrit:789332|Suppress "named" group when TempUser system is disabled (T307675)]] (duration: 00m 48s) [21:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:26] T307675: Mysterious "named" user group - https://phabricator.wikimedia.org/T307675 [21:36:44] welp, still reasoned incorrectly about file sync order, i think it should be good now with the whole includes/user directory synced. [21:36:49] that's my nerves shot for the week. [21:38:03] I see it looks great on my side, so is sync completed [21:38:41] yeah, should be everywhere at this point. [21:39:37] ok thanks! [21:40:36] (03PS10) 10Hoo man: Add missing termbox codes from Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734722 (https://phabricator.wikimedia.org/T277836) (owner: 10Mbch331) [21:48:10] (03PS11) 10Hoo man: Add missing termbox codes from Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734722 (https://phabricator.wikimedia.org/T277836) (owner: 10Mbch331) [21:49:12] brennen: You're done? [21:49:27] yep, all clear. [21:49:50] I'll quickly hijack the window then to deploy this tiny config change :) [21:50:33] have fun! [21:51:00] i'm off for the week; see all you fine people bright and early (for some values & locales of both) monday. [21:51:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:51:53] (03CR) 10Hoo man: [C: 03+2] Add missing termbox codes from Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734722 (https://phabricator.wikimedia.org/T277836) (owner: 10Mbch331) [21:52:26] 10SRE, 10SRE-Access-Requests, 10Community-Tech: Grant Access to PII in Superset for HMonroy and Dmaza - https://phabricator.wikimedia.org/T307737 (10HMonroy) @jhathaway this ticket is exactly the same request as [T296161]. Dmaza and I need the same access as samwilson. Please let me know if you still need me... [21:53:13] (03Merged) 10jenkins-bot: Add missing termbox codes from Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734722 (https://phabricator.wikimedia.org/T277836) (owner: 10Mbch331) [21:56:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:20] !log hoo@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:734722|Add missing termbox codes from Wikibase (T277836)]] (duration: 00m 48s) [21:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:25] T277836: Recent additions to term languages have not been added to InitialiseSettings.php - https://phabricator.wikimedia.org/T277836 [22:00:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:00:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:00:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:06] 10SRE, 10SRE-Access-Requests, 10Community-Tech: Grant Access to PII in Superset for HMonroy and Dmaza - https://phabricator.wikimedia.org/T307737 (10jhathaway) Thanks @HMonroy from my read of T296161 it is ultimately the same as T283190, both add the user to the analytics-privatedata-users group. If you coul... [22:06:36] !log razzi@cumin1001 END (PASS) - Cookbook sre.kafka.reboot-workers (exit_code=0) for Kafka main-eqiad cluster: Reboot kafka nodes [22:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:13] PROBLEM - Check systemd state on mx2001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:13:02] 10SRE, 10SRE-Access-Requests, 10Generated Data Platform, 10Patch-For-Review: Request to add user fkaelin to analytics-platform-eng-admins group - https://phabricator.wikimedia.org/T307573 (10WDoranWMF) Approved, thanks @jhathaway [22:13:05] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:14:27] 10SRE, 10SRE-Access-Requests, 10Generated Data Platform, 10Patch-For-Review: Request to add user fkaelin to analytics-platform-eng-admins group - https://phabricator.wikimedia.org/T307573 (10jhathaway) [22:26:03] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: nginx.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:38:35] (03PS4) 10Razzi: Use both dbproxy101[89] servers for both wikireplica services [puppet] - 10https://gerrit.wikimedia.org/r/779915 (https://phabricator.wikimedia.org/T298940) (owner: 10Btullis) [22:39:47] (03CR) 10Razzi: Use both dbproxy101[89] servers for both wikireplica services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779915 (https://phabricator.wikimedia.org/T298940) (owner: 10Btullis) [23:00:14] (03CR) 10Dduvall: [C: 03+1] "Seems good from my end." [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/789636 (https://phabricator.wikimedia.org/T307507) (owner: 10Dduvall) [23:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:13:04] (03PS1) 10BCornwall: admin: Add UNIX user account for Brett Cornwall [puppet] - 10https://gerrit.wikimedia.org/r/789726 [23:13:39] (03CR) 10jerkins-bot: [V: 04-1] admin: Add UNIX user account for Brett Cornwall [puppet] - 10https://gerrit.wikimedia.org/r/789726 (owner: 10BCornwall) [23:14:11] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:15:57] (03PS2) 10BCornwall: admin: Add UNIX user account for Brett Cornwall [puppet] - 10https://gerrit.wikimedia.org/r/789726 [23:20:57] (03CR) 10BCornwall: "The onboarding instructions mention making two separate commits: One for the user and the second for adding to the ops group. Ssingh seems" [puppet] - 10https://gerrit.wikimedia.org/r/789726 (owner: 10BCornwall) [23:25:33] (03CR) 10Dzahn: [C: 04-1] "please use the UID you already have in LDAP. it's 39104. You can find it like this: [mwmaint1002:~] $ ldapsearch -x mail=bcorn*" [puppet] - 10https://gerrit.wikimedia.org/r/789726 (owner: 10BCornwall) [23:28:56] (03CR) 10Cwhite: [C: 03+1] prometheus: remove high NEL alert, moved to alerts.git [puppet] - 10https://gerrit.wikimedia.org/r/789152 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [23:31:15] (03CR) 10Cwhite: [C: 03+1] "LGTM! Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [23:35:38] (03PS3) 10BCornwall: admin: Add UNIX user account for Brett Cornwall [puppet] - 10https://gerrit.wikimedia.org/r/789726 [23:35:40] (03PS1) 10BCornwall: admin: Add user "brett" to ops group [puppet] - 10https://gerrit.wikimedia.org/r/789727 [23:36:33] (03CR) 10jerkins-bot: [V: 04-1] admin: Add user "brett" to ops group [puppet] - 10https://gerrit.wikimedia.org/r/789727 (owner: 10BCornwall) [23:42:09] (03CR) 10Dzahn: [C: 03+1] "looks good to me (can't check the key itself though)" [puppet] - 10https://gerrit.wikimedia.org/r/789726 (owner: 10BCornwall) [23:42:30] (03CR) 10BCornwall: admin: Add UNIX user account for Brett Cornwall (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/789726 (owner: 10BCornwall) [23:46:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:54:55] (03CR) 10Cwhite: "Approach looks good." [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond) [23:55:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale