[00:08:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P23926 and previous config saved to /var/cache/conftool/dbconfig/20220331-000816-ladsgroup.json [00:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P23927 and previous config saved to /var/cache/conftool/dbconfig/20220331-000856-ladsgroup.json [00:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:45] (JobUnavailable) firing: (2) Reduced availability for job gitlab in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:17:00] !log rzl@apt1001:~$ sudo -i reprepro -C main include buster-wikimedia /home/rzl/httpbb/buster/httpbb_0.0.1-1_source.changes # T299705 [00:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:06] T299705: Debian package for httpbb - https://phabricator.wikimedia.org/T299705 [00:23:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P23928 and previous config saved to /var/cache/conftool/dbconfig/20220331-002321-ladsgroup.json [00:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P23929 and previous config saved to /var/cache/conftool/dbconfig/20220331-002401-ladsgroup.json [00:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:33] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is CRITICAL: 31.07 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [00:25:51] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 57.48 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [00:26:09] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 44.69 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [00:26:41] ^ OK [00:27:49] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [00:32:53] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 70.5 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [00:34:49] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 77.64 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [00:38:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P23930 and previous config saved to /var/cache/conftool/dbconfig/20220331-003826-ladsgroup.json [00:38:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [00:38:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [00:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:33] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [00:38:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T298565)', diff saved to https://phabricator.wikimedia.org/P23931 and previous config saved to /var/cache/conftool/dbconfig/20220331-003834-ladsgroup.json [00:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23932 and previous config saved to /var/cache/conftool/dbconfig/20220331-003906-ladsgroup.json [00:39:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [00:39:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [00:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:13] RECOVERY - SSH on aqs1007.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T298565)', diff saved to https://phabricator.wikimedia.org/P23933 and previous config saved to /var/cache/conftool/dbconfig/20220331-003914-ladsgroup.json [00:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T298565)', diff saved to https://phabricator.wikimedia.org/P23934 and previous config saved to /var/cache/conftool/dbconfig/20220331-004122-ladsgroup.json [00:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:42:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T298557)', diff saved to https://phabricator.wikimedia.org/P23935 and previous config saved to /var/cache/conftool/dbconfig/20220331-004211-marostegui.json [00:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:42:19] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [00:56:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P23936 and previous config saved to /var/cache/conftool/dbconfig/20220331-005627-ladsgroup.json [00:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:57:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P23937 and previous config saved to /var/cache/conftool/dbconfig/20220331-005716-marostegui.json [00:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:11:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P23938 and previous config saved to /var/cache/conftool/dbconfig/20220331-011132-ladsgroup.json [01:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:12:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P23939 and previous config saved to /var/cache/conftool/dbconfig/20220331-011221-marostegui.json [01:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:15:21] PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:21:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298565)', diff saved to https://phabricator.wikimedia.org/P23940 and previous config saved to /var/cache/conftool/dbconfig/20220331-012120-ladsgroup.json [01:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:21:26] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [01:21:41] (03PS1) 10Func: Add logo variants for zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775416 (https://phabricator.wikimedia.org/T273578) [01:26:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T298565)', diff saved to https://phabricator.wikimedia.org/P23941 and previous config saved to /var/cache/conftool/dbconfig/20220331-012637-ladsgroup.json [01:26:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [01:26:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [01:26:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [01:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:26:45] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [01:26:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [01:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:26:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T298565)', diff saved to https://phabricator.wikimedia.org/P23942 and previous config saved to /var/cache/conftool/dbconfig/20220331-012650-ladsgroup.json [01:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T298557)', diff saved to https://phabricator.wikimedia.org/P23943 and previous config saved to /var/cache/conftool/dbconfig/20220331-012726-marostegui.json [01:27:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1147.eqiad.wmnet with reason: Maintenance [01:27:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1147.eqiad.wmnet with reason: Maintenance [01:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:32] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [01:27:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T298557)', diff saved to https://phabricator.wikimedia.org/P23944 and previous config saved to /var/cache/conftool/dbconfig/20220331-012734-marostegui.json [01:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:28:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298565)', diff saved to https://phabricator.wikimedia.org/P23945 and previous config saved to /var/cache/conftool/dbconfig/20220331-012858-ladsgroup.json [01:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:31:11] RECOVERY - Check systemd state on gitlab2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:36:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P23946 and previous config saved to /var/cache/conftool/dbconfig/20220331-013625-ladsgroup.json [01:36:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:40:22] (JobUnavailable) firing: (3) Reduced availability for job gitlab in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1105.eqiad.wmnet with reason: Maintenance [01:41:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1105.eqiad.wmnet with reason: Maintenance [01:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:41:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T300775)', diff saved to https://phabricator.wikimedia.org/P23947 and previous config saved to /var/cache/conftool/dbconfig/20220331-014140-marostegui.json [01:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:41:48] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [01:44:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P23948 and previous config saved to /var/cache/conftool/dbconfig/20220331-014403-ladsgroup.json [01:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:45:22] (JobUnavailable) firing: (3) Reduced availability for job gitlab in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:51:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P23949 and previous config saved to /var/cache/conftool/dbconfig/20220331-015130-ladsgroup.json [01:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:51:49] (03CR) 10Steven Sun: Revert Simplified Chinese logo of zhwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775320 (https://phabricator.wikimedia.org/T276694) (owner: 10Steven Sun) [01:59:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P23950 and previous config saved to /var/cache/conftool/dbconfig/20220331-015908-ladsgroup.json [01:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:01:48] > DannyS712: Yes that's fine, eating lunch now but I'll take care of it when I'm back [02:01:48] RoanKattouw any update? Or should I schedule it for a later window [02:02:42] DannyS712: oh I'm so sorry! I got distracted with other things and forgot [02:03:31] I'm afk right now but I'll merge it in about 30 mins [02:03:39] And this time I'll set a reminder so that I don't forget again [02:03:46] okay, thanks - its just a PHPCS cleanup so shouldn't need any testing [02:06:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298565)', diff saved to https://phabricator.wikimedia.org/P23951 and previous config saved to /var/cache/conftool/dbconfig/20220331-020635-ladsgroup.json [02:06:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [02:06:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [02:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:41] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [02:06:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T298565)', diff saved to https://phabricator.wikimedia.org/P23952 and previous config saved to /var/cache/conftool/dbconfig/20220331-020643-ladsgroup.json [02:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:10:45] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:12:54] (03CR) 10DannyS712: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775005 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [02:14:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298565)', diff saved to https://phabricator.wikimedia.org/P23953 and previous config saved to /var/cache/conftool/dbconfig/20220331-021413-ladsgroup.json [02:14:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [02:14:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [02:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:14:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [02:14:20] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [02:14:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [02:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:14:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [02:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:14:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [02:14:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [02:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:14:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [02:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:14:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [02:14:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [02:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:14:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23954 and previous config saved to /var/cache/conftool/dbconfig/20220331-021450-ladsgroup.json [02:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:16:29] RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:17:47] (03CR) 10DannyS712: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775426 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [02:17:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23955 and previous config saved to /var/cache/conftool/dbconfig/20220331-021758-ladsgroup.json [02:18:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:20:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T298565)', diff saved to https://phabricator.wikimedia.org/P23956 and previous config saved to /var/cache/conftool/dbconfig/20220331-022008-ladsgroup.json [02:20:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:20:14] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [02:29:31] (03CR) 10DannyS712: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775427 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [02:33:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P23957 and previous config saved to /var/cache/conftool/dbconfig/20220331-023303-ladsgroup.json [02:33:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:35:11] (03CR) 10Catrope: [C: 03+2] phpcs: clean up MWConfigCacheGenerator and enable rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773966 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [02:35:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P23958 and previous config saved to /var/cache/conftool/dbconfig/20220331-023513-ladsgroup.json [02:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:36:14] (03Merged) 10jenkins-bot: phpcs: clean up MWConfigCacheGenerator and enable rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773966 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [02:43:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:44:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:44:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:45:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:45:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:48:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P23959 and previous config saved to /var/cache/conftool/dbconfig/20220331-024808-ladsgroup.json [02:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:50:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P23960 and previous config saved to /var/cache/conftool/dbconfig/20220331-025018-ladsgroup.json [02:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:50:35] !log catrope@deploy1002 Synchronized multiversion/MWConfigCacheGenerator.php: [[gerrit:773966|Code style-only change to MWConfigCacheGenerator.php]] (duration: 00m 52s) [02:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:01:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T298557)', diff saved to https://phabricator.wikimedia.org/P23961 and previous config saved to /var/cache/conftool/dbconfig/20220331-030146-marostegui.json [03:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:01:52] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [03:03:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23962 and previous config saved to /var/cache/conftool/dbconfig/20220331-030313-ladsgroup.json [03:03:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [03:03:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [03:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:03:19] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [03:03:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T298565)', diff saved to https://phabricator.wikimedia.org/P23963 and previous config saved to /var/cache/conftool/dbconfig/20220331-030321-ladsgroup.json [03:03:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:03:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:05:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T298565)', diff saved to https://phabricator.wikimedia.org/P23964 and previous config saved to /var/cache/conftool/dbconfig/20220331-030523-ladsgroup.json [03:05:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [03:05:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [03:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:05:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T298565)', diff saved to https://phabricator.wikimedia.org/P23965 and previous config saved to /var/cache/conftool/dbconfig/20220331-030531-ladsgroup.json [03:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:16:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P23966 and previous config saved to /var/cache/conftool/dbconfig/20220331-031651-marostegui.json [03:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:24:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T298565)', diff saved to https://phabricator.wikimedia.org/P23967 and previous config saved to /var/cache/conftool/dbconfig/20220331-032401-ladsgroup.json [03:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:24:08] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [03:31:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P23968 and previous config saved to /var/cache/conftool/dbconfig/20220331-033156-marostegui.json [03:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:39:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P23969 and previous config saved to /var/cache/conftool/dbconfig/20220331-033906-ladsgroup.json [03:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:40:54] (03PS1) 10Func: Use variants fallback to define logos for zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775423 (https://phabricator.wikimedia.org/T273578) [03:45:32] PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:47:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T298557)', diff saved to https://phabricator.wikimedia.org/P23970 and previous config saved to /var/cache/conftool/dbconfig/20220331-034701-marostegui.json [03:47:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1148.eqiad.wmnet with reason: Maintenance [03:47:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1148.eqiad.wmnet with reason: Maintenance [03:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:47:07] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [03:47:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T298557)', diff saved to https://phabricator.wikimedia.org/P23971 and previous config saved to /var/cache/conftool/dbconfig/20220331-034709-marostegui.json [03:47:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:47:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:47:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:54:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P23972 and previous config saved to /var/cache/conftool/dbconfig/20220331-035411-ladsgroup.json [03:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:03:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298565)', diff saved to https://phabricator.wikimedia.org/P23973 and previous config saved to /var/cache/conftool/dbconfig/20220331-040336-ladsgroup.json [04:03:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:03:42] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [04:09:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T298565)', diff saved to https://phabricator.wikimedia.org/P23974 and previous config saved to /var/cache/conftool/dbconfig/20220331-040916-ladsgroup.json [04:09:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:09:23] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [04:09:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [04:09:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [04:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:09:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T298565)', diff saved to https://phabricator.wikimedia.org/P23975 and previous config saved to /var/cache/conftool/dbconfig/20220331-040940-ladsgroup.json [04:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:15:17] (03PS3) 10DannyS712: phpcs: enable rules that are already passing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775005 (https://phabricator.wikimedia.org/T171115) [04:15:20] (03PS3) 10DannyS712: phpcs: rename test files to match class names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775426 (https://phabricator.wikimedia.org/T171115) [04:15:23] (03PS3) 10DannyS712: phpcs: enable and fix PropertyDocumentation.MissingVar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775427 (https://phabricator.wikimedia.org/T171115) [04:18:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P23976 and previous config saved to /var/cache/conftool/dbconfig/20220331-041841-ladsgroup.json [04:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:33:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P23977 and previous config saved to /var/cache/conftool/dbconfig/20220331-043346-ladsgroup.json [04:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:38:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 22 hosts with reason: Primary switchover s5 T303798 [04:38:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:38:43] T303798: Switchover s5 master db1130 -> db1100 - https://phabricator.wikimedia.org/T303798 [04:38:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 22 hosts with reason: Primary switchover s5 T303798 [04:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:39:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1100 with weight 0 T303798', diff saved to https://phabricator.wikimedia.org/P23978 and previous config saved to /var/cache/conftool/dbconfig/20220331-043906-marostegui.json [04:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:48:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298565)', diff saved to https://phabricator.wikimedia.org/P23979 and previous config saved to /var/cache/conftool/dbconfig/20220331-044851-ladsgroup.json [04:48:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [04:48:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [04:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:48:58] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [04:48:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P23980 and previous config saved to /var/cache/conftool/dbconfig/20220331-044859-ladsgroup.json [04:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:49:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:53:29] (03PS2) 10Marostegui: wmnet: Update s5-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/775196 (https://phabricator.wikimedia.org/T303798) [04:53:36] (03PS2) 10Marostegui: mariadb: Promote db1100 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/775195 (https://phabricator.wikimedia.org/T303798) [05:02:46] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 137 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:04:47] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1100 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/775195 (https://phabricator.wikimedia.org/T303798) (owner: 10Marostegui) [05:04:59] RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: (C)100 gt (W)50 gt 15 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:19:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T298557)', diff saved to https://phabricator.wikimedia.org/P23981 and previous config saved to /var/cache/conftool/dbconfig/20220331-051954-marostegui.json [05:20:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:01] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [05:23:42] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.7581 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [05:26:42] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.2631 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [05:28:04] 👋 looking [05:28:14] rzl I am checking too [05:28:28] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:28:29] oh, good morning! [05:29:46] looks like about 1k rps extra traffic at the api servers, I think s3 writes [05:30:13] we are getting too many connections in s7 as well [05:30:48] s3 is quite crazy https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=core&var-shard=s3&var-role=All&refresh=1m&viewPanel=7 [05:31:16] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 154 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:31:24] might be another hot template change but I'm not sure yet [05:31:38] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:31:46] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 523 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:35:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P23983 and previous config saved to /var/cache/conftool/dbconfig/20220331-053459-marostegui.json [05:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:02] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.6471 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [05:40:02] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 375 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:41:24] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [05:42:06] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:42:36] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:45:45] (JobUnavailable) firing: (2) Reduced availability for job gitlab in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:49:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P23984 and previous config saved to /var/cache/conftool/dbconfig/20220331-054913-ladsgroup.json [05:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:20] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [05:50:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P23985 and previous config saved to /var/cache/conftool/dbconfig/20220331-055004-marostegui.json [05:50:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:30] PROBLEM - Host ms-be1069 is DOWN: PING CRITICAL - Packet loss = 100% [05:55:54] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [06:00:04] kormat, marostegui, and Amir1: (Dis)respected human, time to deploy Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220331T0600). Please do the needful. [06:00:06] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [06:00:19] !log Starting s5 eqiad failover from db1130 to db1100 - T303798 [06:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:25] T303798: Switchover s5 master db1130 -> db1100 - https://phabricator.wikimedia.org/T303798 [06:00:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set s5 eqiad as read-only for maintenance - T303798', diff saved to https://phabricator.wikimedia.org/P23986 and previous config saved to /var/cache/conftool/dbconfig/20220331-060042-root.json [06:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1100 to s5 primary and set section read-write T303798', diff saved to https://phabricator.wikimedia.org/P23987 and previous config saved to /var/cache/conftool/dbconfig/20220331-060122-root.json [06:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:32] all done [06:04:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P23988 and previous config saved to /var/cache/conftool/dbconfig/20220331-060418-ladsgroup.json [06:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T298557)', diff saved to https://phabricator.wikimedia.org/P23989 and previous config saved to /var/cache/conftool/dbconfig/20220331-060509-marostegui.json [06:05:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1149.eqiad.wmnet with reason: Maintenance [06:05:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1149.eqiad.wmnet with reason: Maintenance [06:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:15] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [06:05:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T298557)', diff saved to https://phabricator.wikimedia.org/P23990 and previous config saved to /var/cache/conftool/dbconfig/20220331-060517-marostegui.json [06:05:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:26] (03CR) 10Marostegui: [C: 03+2] wmnet: Update s5-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/775196 (https://phabricator.wikimedia.org/T303798) (owner: 10Marostegui) [06:07:32] PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:08:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1130 T303798', diff saved to https://phabricator.wikimedia.org/P23991 and previous config saved to /var/cache/conftool/dbconfig/20220331-060820-root.json [06:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:26] T303798: Switchover s5 master db1130 -> db1100 - https://phabricator.wikimedia.org/T303798 [06:10:45] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:12:23] !log dbmaint s5@eqiad T300381 [06:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:32] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [06:15:42] (03PS1) 10Marostegui: db1130: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/775722 (https://phabricator.wikimedia.org/T300473) [06:19:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P23992 and previous config saved to /var/cache/conftool/dbconfig/20220331-061923-ladsgroup.json [06:19:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:27] (03CR) 10Marostegui: [C: 03+2] db1130: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/775722 (https://phabricator.wikimedia.org/T300473) (owner: 10Marostegui) [06:22:57] (03CR) 10Giuseppe Lavagetto: [C: 03+2] varnish: update comment in dynamic actions [puppet] - 10https://gerrit.wikimedia.org/r/775315 (owner: 10Giuseppe Lavagetto) [06:29:58] (03PS2) 10Giuseppe Lavagetto: requestctl::client: install preview files for actions [puppet] - 10https://gerrit.wikimedia.org/r/775316 [06:30:58] ACKNOWLEDGEMENT - SSH on ms-be1069 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Luca Toscano T299462 https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:30:58] ACKNOWLEDGEMENT - Host ms-be1069 is DOWN: PING CRITICAL - Packet loss = 100% Luca Toscano T299462 [06:31:07] topranks: --^ [06:31:55] (03CR) 10Giuseppe Lavagetto: [C: 03+2] requestctl::client: install preview files for actions [puppet] - 10https://gerrit.wikimedia.org/r/775316 (owner: 10Giuseppe Lavagetto) [06:34:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P23993 and previous config saved to /var/cache/conftool/dbconfig/20220331-063429-ladsgroup.json [06:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:35] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [06:38:41] (03PS1) 10Giuseppe Lavagetto: profile::requestctl_client: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/775783 [06:38:56] (03PS2) 10Giuseppe Lavagetto: profile::requestctl_client: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/775783 [06:40:26] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] profile::requestctl_client: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/775783 (owner: 10Giuseppe Lavagetto) [06:41:13] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:43:46] (03PS1) 10Sergio Gimeno: Post-edit dialog: check for presence of preferences.topicFilters [extensions/GrowthExperiments] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/775784 (https://phabricator.wikimedia.org/T305057) [06:46:41] (03PS1) 10Giuseppe Lavagetto: requestctl_client: remove prefix, already in confd, fix path [puppet] - 10https://gerrit.wikimedia.org/r/775785 [06:47:05] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] requestctl_client: remove prefix, already in confd, fix path [puppet] - 10https://gerrit.wikimedia.org/r/775785 (owner: 10Giuseppe Lavagetto) [06:48:58] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-cache1002.eqiad.wmnet with OS bullseye [06:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:04] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-cache1002.eqiad.wmnet with OS bullseye [06:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:32] (03CR) 10Elukey: "One last thing and then we can merge!" [puppet] - 10https://gerrit.wikimedia.org/r/774488 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [06:56:01] PROBLEM - Confd template for /var/lib/requestctl/tests/cache-actions.inc.vcl on puppetmaster2001 is CRITICAL: File not found: /var/lib/requestctl/tests/cache-actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [06:57:38] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for prometheus-blackbox-exporter [puppet] - 10https://gerrit.wikimedia.org/r/775288 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [06:57:57] jouncebot: now [06:57:57] No deployments scheduled for the next 0 hour(s) and 2 minute(s) [06:58:03] ... [06:58:06] jouncebot: next [06:58:07] In 0 hour(s) and 1 minute(s): UTC morning backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220331T0700) [06:58:48] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for ipmiseld [puppet] - 10https://gerrit.wikimedia.org/r/775281 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [06:59:13] I will run the train after the backport window [06:59:25] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for Routinator [puppet] - 10https://gerrit.wikimedia.org/r/775302 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [06:59:28] hashar: There are two changes from the GrowthExperiments extension we want to backport in the next window. Should I squash them? Is squashing always a good practice for backporting changes? (when possible) [06:59:45] the opposite [06:59:54] we just cherry pick changes ;] [07:00:05] Amir1, apergos, and taavi: My dear minions, it's time we take the moon! Just kidding. Time for UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220331T0700). [07:00:05] duesen and Sergi0: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:16] hello! we have two trainees today if we are lucky [07:00:43] one of them will be deploying today (duesen) as soon as he gets his tech issues sorted out [07:00:50] sergi0: gerrit kind of enforce people to write the smallest patches possible, or at least writng small patches make it easier to use Gerrit :] [07:00:56] but we can CR+2 both patches [07:01:06] they will more or less merge at the same time [07:01:23] hashar: alright, got it. I think I've heard squashing was done in the past maybe for time constraints [07:01:29] yes we prefer each patch go in separately so that e.g. order of files deployed is not an issue [07:01:41] then on the deployment server we can deploy them one by one [07:02:00] theoretically we could CR+2 all patches ahead of the start of the window [07:02:14] and deploy each of them individually since we manually git fetch on the deploy server [07:02:42] sergi0, are you self deploy or do you need an assist? [07:02:46] (03PS2) 10Muehlenhoff: Move Prometheus Apache setup to separate profile [puppet] - 10https://gerrit.wikimedia.org/r/775296 [07:03:05] I need assistance [07:03:07] ok [07:03:23] now I want one more deployer here so that I am not both doing the training and the deployment at the same time [07:03:33] Amir1 or taavi are you around? [07:04:04] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/775296 (owner: 10Muehlenhoff) [07:04:52] elukey: thanks, and sry for the noise [07:05:13] np! I cc-ed you just as confirmation that I wasn't still asleep when acking :) [07:05:15] apergos: can't help immediately sorry (gotta have my shower, brush teeth etc). But I guess i will be here in 20 minutes or so [07:05:29] ok, let's see when Daniel gets here for his patch [07:05:30] you might want to CR+2 the patch right now to trigger CI [07:05:32] yep [07:06:01] (03Abandoned) 10Sergio Gimeno: Post-edit dialog: check for presence of preferences.topicFilters [extensions/GrowthExperiments] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/775784 (https://phabricator.wikimedia.org/T305057) (owner: 10Sergio Gimeno) [07:06:56] sergi0: donn't you want to deploy that change ^ ? [07:07:16] (03CR) 10Muehlenhoff: Move Prometheus Apache setup to separate profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/775296 (owner: 10Muehlenhoff) [07:07:45] (03PS5) 10Muehlenhoff: Enable profile::auto_restarts::service for parsoid::testing [puppet] - 10https://gerrit.wikimedia.org/r/769725 (https://phabricator.wikimedia.org/T135991) [07:07:51] I abandoned the squashed change based on our conversations. Changes to backports are the ones stated in the window: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/775370, https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/775371 [07:08:00] ok great [07:08:15] o/ [07:08:19] hello! [07:08:31] (03CR) 10Muehlenhoff: [C: 03+2] Stop using profile::base::linux419 on Hadoop nodes [puppet] - 10https://gerrit.wikimedia.org/r/774423 (owner: 10Muehlenhoff) [07:08:49] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:08:55] OH [07:10:26] your patch is first in the window and we are talking Daniel through it [07:11:35] RECOVERY - Host ms-be1069 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [07:13:02] (03CR) 10ArielGlenn: [C: 03+2] Set MW_USE_CONFIG_SCHEMA constant if file exists. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772937 (https://phabricator.wikimedia.org/T304460) (owner: 10Daniel Kinzler) [07:13:42] (03Merged) 10jenkins-bot: Set MW_USE_CONFIG_SCHEMA constant if file exists. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772937 (https://phabricator.wikimedia.org/T304460) (owner: 10Daniel Kinzler) [07:16:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:17:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:50] !log updating libapache2-mod-auth-cas on buster hosts [07:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:25] RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:22:19] apergos: my patch is on mwdebug1002 now [07:22:24] go ahead and test then! [07:23:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:23:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:23:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:33] apergos: done testing on mwdebug, will deploy now [07:27:37] awesome! [07:29:49] (03PS1) 10Giuseppe Lavagetto: profile::conftool::requestctl_client: more fixes [puppet] - 10https://gerrit.wikimedia.org/r/775787 [07:30:29] !log daniel@deploy1002 Synchronized multiversion/defines.php: Config: [[gerrit:772937|Set MW_USE_CONFIG_SCHEMA constant if file exists. (T304460)]] (duration: 00m 52s) [07:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:36] T304460: Roll out loading of default settings via SettingsBuilder - https://phabricator.wikimedia.org/T304460 [07:31:02] just waiting a couple minutes and watching the dashboards before starting on your patches, sergi0 [07:31:30] apergos: great, thank you [07:33:34] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34634/console" [puppet] - 10https://gerrit.wikimedia.org/r/775787 (owner: 10Giuseppe Lavagetto) [07:34:02] sergi0: should we do your patches in the order listed in the window? [07:34:21] apergos: the order is not relevant. As you wish. [07:34:25] ok! [07:35:43] (03CR) 10Daniel Kinzler: [C: 03+2] Post-edit dialog: check for presence of preferences.topicFilters [extensions/GrowthExperiments] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/775370 (https://phabricator.wikimedia.org/T305057) (owner: 10Sergio Gimeno) [07:35:57] ok the first patch will go first [07:36:21] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] profile::conftool::requestctl_client: more fixes [puppet] - 10https://gerrit.wikimedia.org/r/775787 (owner: 10Giuseppe Lavagetto) [07:37:29] apergos: alright [07:38:56] sergi0: once we have this on mwdebug1002, can you test it in your browser? [07:40:00] duesen: yes, I just need a couple of minutes to do a browser test [07:40:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T298557)', diff saved to https://phabricator.wikimedia.org/P23994 and previous config saved to /var/cache/conftool/dbconfig/20220331-074010-marostegui.json [07:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:16] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [07:41:39] !log depool cp3056 for reimage - T290005 [07:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:46] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [07:44:10] (03PS2) 10MMandere: site: Reimage cp3056 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/775321 (https://phabricator.wikimedia.org/T290005) [07:45:08] (03CR) 10MMandere: [C: 03+2] site: Reimage cp3056 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/775321 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [07:45:09] sergi0: still waiting for the first patch to merge. I'll let you know when it's up [07:45:45] duesen: perfect, ty [07:46:11] hashar: we might run over a little into the train window, apologies in advance (maybe 15 mins?) [07:47:11] I can also move the second patch to a later window if that helps [07:49:34] let's see what hashar says [07:49:50] RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:50:58] PROBLEM - Confd template for /var/lib/requestctl/tests/cache-actions.inc.vcl on puppetmaster1001 is CRITICAL: File not found: /var/lib/requestctl/tests/cache-actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:53:49] back [07:53:58] apergos: no worries take your time [07:54:03] ok, thanks! [07:54:16] jnuche: train window is shifted a bit cause the backport window is taking a little bit more time :] [07:54:45] hashar: 👍 [07:55:15] I will deploy https://gerrit.wikimedia.org/r/c/mediawiki/extensions/OATHAuth/+/774996/ as well [07:55:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P23995 and previous config saved to /var/cache/conftool/dbconfig/20220331-075515-marostegui.json [07:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:27] to restore some 2FA related workflow on wikitech [07:56:55] still waiting on zuul/jenkins... [07:57:42] those test suites definitely need some care [07:58:46] apergos: I'm a bit short in time myself. I'm gonna move the second patch to a later window if you don't mind. [07:58:51] that's fine [07:59:20] sorry things got a bit of a slow start [08:00:05] hashar and jeena: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220331T0800). [08:00:35] (03Merged) 10jenkins-bot: Post-edit dialog: check for presence of preferences.topicFilters [extensions/GrowthExperiments] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/775370 (https://phabricator.wikimedia.org/T305057) (owner: 10Sergio Gimeno) [08:00:40] we'll be done in just a few minutes, still waiting on zuul/jenkins which has been claiming 0 minutes for like 10 minutes now :-D [08:01:28] got merged [08:01:37] yep [08:01:58] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:02:13] one of the primary reason for the slowness is anything depending on Wikibase ends up running every single of their tests [08:03:09] hashar: a bunch of extensions seem to be modified when we look at git status in the wmf.5 branch [08:03:31] can we ignore this? [08:03:35] hmm [08:03:45] maybe cause of security patches? [08:03:51] perhaps [08:04:20] oh it is [08:04:28] git diff --submodule=log [08:04:48] so yeah it is fine (I have checked) [08:05:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:10] so hashar, what do we need to do about the git fetch/log/rebase commands? anything special? [08:06:41] or also the git submodule update we would do after [08:07:10] git fetch mediawiki/core [08:07:14] git log HEAD..HEAD@{u} [08:07:18] git rebase [08:07:28] the same as usual [08:07:28] ok [08:07:29] then for an extension update the submodule [08:07:33] yep [08:08:11] well pretty much whatever is listed by https://deploy-commands.toolforge.org/bacc/775370 ;] [08:08:31] yeah just that the security patches are a new wrinkle [08:08:50] I think my team has work going on to implement that in scap so one would do something like `scap backport 775370` and everything will happen magically [08:09:13] for the security patches, we have the git submodules configured to rebase automatically [08:09:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:09:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:23] so unless there is a conflict, the security patches get reapplied [08:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:00] good [08:10:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:08] yes it looks like that's working out fine [08:10:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P23996 and previous config saved to /var/cache/conftool/dbconfig/20220331-081020-marostegui.json [08:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:53] sergi0: your patch is on mwdebug1002 now. took a while to merge... [08:11:07] ok, testing now [08:11:14] please give it a quick test and let us know if it works as expected [08:15:55] duesen: tested, no issues. We're good! [08:16:56] excellent! [08:17:41] Thank you duesen and arpegos. Need to run now. [08:18:59] ok! it's goig out to the cluster now [08:19:30] !log daniel@deploy1002 Synchronized php-1.39.0-wmf.5/extensions/GrowthExperiments/modules/ext.growthExperiments.PostEdit/index.js: Backport: [[gerrit:775370|Post-edit dialog: check for presence of preferences.topicFilters (T305057)]] (duration: 00m 53s) [08:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:36] T305057: Post-edit dialog is not shown for subsequent edits when topic hasn't been selected - https://phabricator.wikimedia.org/T305057 [08:19:49] (03CR) 10Hashar: [C: 03+2] Revert "OATHUserRepository: Stop handling legacy single-key" [extensions/OATHAuth] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/774996 (https://phabricator.wikimedia.org/T305029) (owner: 10Zabe) [08:19:56] +2ed this one and I will deploy [08:19:59] then do the train [08:20:14] sergi0: your patch has been pushed out to all hosts now [08:21:39] sounds great hashar, you can even close the window after that patch goes around [08:21:46] duesen: perfect, ty! [08:22:00] yup [08:23:33] PROBLEM - Check systemd state on wdqs1011 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:24:44] jnuche: I am in the good meet :) [08:24:55] err google meet [08:25:17] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp3056.esams.wmnet with OS buster [08:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T298557)', diff saved to https://phabricator.wikimedia.org/P23997 and previous config saved to /var/cache/conftool/dbconfig/20220331-082525-marostegui.json [08:25:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [08:25:27] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp3056.esams.wmnet with OS buster [08:25:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [08:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:31] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [08:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:55] going to do that backport and immediately after run the train [08:26:00] (03Merged) 10jenkins-bot: Revert "OATHUserRepository: Stop handling legacy single-key" [extensions/OATHAuth] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/774996 (https://phabricator.wikimedia.org/T305029) (owner: 10Zabe) [08:26:07] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 3 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo) [08:27:02] hashar: I'll enable experimental config loading on mwdebug1002 and mw1415 soon (T304460). If anything explodes there, it was me. [08:27:03] T304460: Roll out loading of default settings via SettingsBuilder - https://phabricator.wikimedia.org/T304460 [08:27:48] duesen: can you do that after the train please [08:29:59] the train Meet is https://meet.google.com/szx-nwco-pvh [08:30:27] !log hashar@deploy1002 Synchronized php-1.39.0-wmf.5/extensions/OATHAuth/src/OATHUserRepository.php: Backport: [[gerrit:774996|Revert "OATHUserRepository: Stop handling legacy single-key" (T305029)]] (duration: 00m 51s) [08:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:32] T305029: PHP Notice: Undefined index: keys - https://phabricator.wikimedia.org/T305029 [08:32:03] hashar: ok, i'll try to find a better slow [08:32:26] well it can be done in like half an hour ;) [08:33:00] (03PS1) 10Hashar: all wikis to 1.39.0-wmf.5 refs T300204 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775798 [08:33:02] (03CR) 10Hashar: [C: 03+2] all wikis to 1.39.0-wmf.5 refs T300204 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775798 (owner: 10Hashar) [08:33:45] (03Merged) 10jenkins-bot: all wikis to 1.39.0-wmf.5 refs T300204 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775798 (owner: 10Hashar) [08:33:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-corp2001.wikimedia.org [08:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:12] hashar: tight schedule today, but perhaps we can do at least mwdebug in half an hour. I'll ping you again [08:35:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:38] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.39.0-wmf.5 refs T300204 [08:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:45] T300204: 1.39.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T300204 [08:35:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:35:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-corp2001.wikimedia.org [08:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:44] duesen: train looks fine :] [08:38:46] (03CR) 10Ayounsi: [C: 03+2] analytics1-a-eqiad: replace firewall filter with strict uRPF [homer/public] - 10https://gerrit.wikimedia.org/r/775280 (https://phabricator.wikimedia.org/T298087) (owner: 10Ayounsi) [08:39:05] (03CR) 10Btullis: [C: 03+1] "LGTM." [homer/public] - 10https://gerrit.wikimedia.org/r/775280 (https://phabricator.wikimedia.org/T298087) (owner: 10Ayounsi) [08:39:15] (03PS9) 10Klausman: hiera: Add ML staging k8s role [puppet] - 10https://gerrit.wikimedia.org/r/774488 (https://phabricator.wikimedia.org/T302195) [08:39:22] (03Merged) 10jenkins-bot: analytics1-a-eqiad: replace firewall filter with strict uRPF [homer/public] - 10https://gerrit.wikimedia.org/r/775280 (https://phabricator.wikimedia.org/T298087) (owner: 10Ayounsi) [08:39:54] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-ui1001.eqiad.wmnet [08:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:00] !log analytics1-a-eqiad: replace firewall filter with strict uRPF - T298087 [08:40:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:57] (03PS10) 10Klausman: hiera: Add ML staging k8s role [puppet] - 10https://gerrit.wikimedia.org/r/774488 (https://phabricator.wikimedia.org/T302195) [08:40:59] (03CR) 10Klausman: hiera: Add ML staging k8s role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774488 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [08:41:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:03] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-ui1001.eqiad.wmnet [08:42:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:21] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-worker1001.eqiad.wmnet [08:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:57] 10SRE-swift-storage: Media storage metadata inconsistent with Swift or corrupted in general - https://phabricator.wikimedia.org/T289996 (10jcrespo) [08:46:25] yup train looks good [08:46:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:46:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:37] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:50:11] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-worker1001.eqiad.wmnet [08:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:20] !log depool cp4029 for reimage - T290005 [08:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:26] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [08:54:12] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3056.esams.wmnet with reason: host reimage [08:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:50] 10SRE-swift-storage: Media storage metadata inconsistent with Swift or corrupted in general - https://phabricator.wikimedia.org/T289996 (10jcrespo) [08:54:55] PROBLEM - Check systemd state on gitlab-runner1001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens14.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:55:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-corp1001.wikimedia.org [08:55:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet [08:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:42] 10SRE-swift-storage: Media storage metadata inconsistent with Swift or corrupted in general - https://phabricator.wikimedia.org/T289996 (10jcrespo) [08:57:39] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3056.esams.wmnet with reason: host reimage [08:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:00] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be1069.eqiad.wmnet with OS stretch [08:58:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:05] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host ms-be1069.eqiad.wmnet with OS stretch executed with errors: -... [08:58:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet [08:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:20] (03PS2) 10MMandere: site: Reimage cp4029 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/775322 (https://phabricator.wikimedia.org/T290005) [08:58:49] (03CR) 10Subramanya Sastry: [C: 03+1] Enable profile::auto_restarts::service for parsoid::testing [puppet] - 10https://gerrit.wikimedia.org/r/769725 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:58:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-corp1001.wikimedia.org [08:58:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:10] (03CR) 10MMandere: [C: 03+2] site: Reimage cp4029 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/775322 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [08:59:43] (03PS1) 10Ayounsi: analytics1-b/c/d-eqiad: replace firewall filter with strict uRPF [homer/public] - 10https://gerrit.wikimedia.org/r/775805 (https://phabricator.wikimedia.org/T298087) [09:02:06] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp4029.ulsfo.wmnet with OS buster [09:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:15] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp4029.ulsfo.wmnet with OS buster [09:05:00] hashar: hey, can we start poking at wmdebug now? [09:05:34] duesen: yeah train is done :] [09:06:34] (03PS2) 10Ayounsi: analytics1-b/c/d-eqiad: replace firewall filter with strict uRPF [homer/public] - 10https://gerrit.wikimedia.org/r/775805 (https://phabricator.wikimedia.org/T298087) [09:06:47] (03CR) 10Jaime Nuche: [C: 03+1] scap: make rsync use new compress algorithm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774824 (https://phabricator.wikimedia.org/T252540) (owner: 10Hashar) [09:07:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3311 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P23999 and previous config saved to /var/cache/conftool/dbconfig/20220331-090717-root.json [09:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:50] !log created /var/run/php/use-config-schema on mwdebug1002 to enable config schema loading (T304460) [09:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:56] T304460: Roll out loading of default settings via SettingsBuilder - https://phabricator.wikimedia.org/T304460 [09:08:57] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on ms-be1069.eqiad.wmnet with reason: Puppet errors during reimage [09:08:58] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ms-be1069.eqiad.wmnet with reason: Puppet errors during reimage [09:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:08] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on ms-be1069.eqiad.wmnet with reason: Puppet errors during reimage [09:09:09] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on ms-be1069.eqiad.wmnet with reason: Puppet errors during reimage [09:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:45] PROBLEM - Check systemd state on wdqs2006 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:13:05] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10cmooney) @MatthewVernon in an attempt to gather debug information from "the instant the problem first occured" I attempted a reimage of ms-be1069 this evenin... [09:13:21] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:13:21] PROBLEM - Check systemd state on wdqs2008 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:15:41] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:16:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1175.eqiad.wmnet with reason: Maintenance [09:16:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1175.eqiad.wmnet with reason: Maintenance [09:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T297189)', diff saved to https://phabricator.wikimedia.org/P24000 and previous config saved to /var/cache/conftool/dbconfig/20220331-091626-marostegui.json [09:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:34] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [09:16:48] !log created /var/run/php/use-config-schema on canary mw1415 to enable config schema loading (T304460) [09:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:53] T304460: Roll out loading of default settings via SettingsBuilder - https://phabricator.wikimedia.org/T304460 [09:17:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host grafana1002.eqiad.wmnet [09:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:35] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:18:13] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-worker1002.eqiad.wmnet [09:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:24] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4029.ulsfo.wmnet with reason: host reimage [09:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana1002.eqiad.wmnet [09:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:18] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4029.ulsfo.wmnet with reason: host reimage [09:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:42] (03PS1) 10Jelto: gitlab_runner: overwrite default service unit file [puppet] - 10https://gerrit.wikimedia.org/r/775808 (https://phabricator.wikimedia.org/T295481) [09:22:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3311 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P24001 and previous config saved to /var/cache/conftool/dbconfig/20220331-092221-root.json [09:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:24] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-worker1002.eqiad.wmnet [09:23:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:38] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-worker1003.eqiad.wmnet [09:23:38] PROBLEM - Check systemd state on wdqs2005 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:35] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34635/console" [puppet] - 10https://gerrit.wikimedia.org/r/775808 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [09:25:01] !log removed /var/run/php/use-config-schema from mwdebug1002 to disable config schema loading (T304460) [09:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:06] T304460: Roll out loading of default settings via SettingsBuilder - https://phabricator.wikimedia.org/T304460 [09:26:58] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3056.esams.wmnet with OS buster [09:27:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:06] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp3056.esams.wmnet with OS buster com... [09:29:49] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: overwrite default service unit file [puppet] - 10https://gerrit.wikimedia.org/r/775808 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [09:29:55] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-worker1003.eqiad.wmnet [09:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:37] 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Infrastructure-Foundations, 10SRE Observability (FY2021/2022-Q3): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10Nemo_bis) [09:34:11] (03PS1) 10Jelto: Revert "gitlab_runner: overwrite default service unit file" [puppet] - 10https://gerrit.wikimedia.org/r/775431 [09:35:10] (03CR) 10Jelto: [C: 03+2] Revert "gitlab_runner: overwrite default service unit file" [puppet] - 10https://gerrit.wikimedia.org/r/775431 (owner: 10Jelto) [09:37:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3311 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P24002 and previous config saved to /var/cache/conftool/dbconfig/20220331-093725-root.json [09:37:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:01] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4029.ulsfo.wmnet with OS buster [09:43:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T297189)', diff saved to https://phabricator.wikimedia.org/P24003 and previous config saved to /var/cache/conftool/dbconfig/20220331-094304-marostegui.json [09:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:08] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp4029.ulsfo.wmnet with OS buster com... [09:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:11] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [09:44:45] (03PS1) 10JMeybohm: Remove BGP config for kubernetes[12]00[1-4] [homer/public] - 10https://gerrit.wikimedia.org/r/775814 (https://phabricator.wikimedia.org/T303044) [09:45:45] (JobUnavailable) firing: (2) Reduced availability for job gitlab in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:47:54] (03PS1) 10Jelto: gitlab_runner: override User of default service unit file [puppet] - 10https://gerrit.wikimedia.org/r/775815 (https://phabricator.wikimedia.org/T295481) [09:48:42] (03CR) 10Elukey: [C: 03+1] Remove BGP config for kubernetes[12]00[1-4] [homer/public] - 10https://gerrit.wikimedia.org/r/775814 (https://phabricator.wikimedia.org/T303044) (owner: 10JMeybohm) [09:49:40] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:49:48] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34636/console" [puppet] - 10https://gerrit.wikimedia.org/r/775815 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [09:50:19] (03CR) 10JMeybohm: [C: 03+2] Remove BGP config for kubernetes[12]00[1-4] [homer/public] - 10https://gerrit.wikimedia.org/r/775814 (https://phabricator.wikimedia.org/T303044) (owner: 10JMeybohm) [09:51:02] (03CR) 10Ayounsi: [C: 03+2] labs-in filter: remove PXE term [homer/public] - 10https://gerrit.wikimedia.org/r/769657 (owner: 10Ayounsi) [09:51:28] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [homer/public] - 10https://gerrit.wikimedia.org/r/775805 (https://phabricator.wikimedia.org/T298087) (owner: 10Ayounsi) [09:51:40] (03Merged) 10jenkins-bot: labs-in filter: remove PXE term [homer/public] - 10https://gerrit.wikimedia.org/r/769657 (owner: 10Ayounsi) [09:51:51] (03CR) 10Elukey: [C: 03+2] Apply the istio sidecar/mesh settings to the ml-serve configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/775343 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [09:52:12] (03CR) 10Elukey: [C: 03+2] knative-serving: refactor support for egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/775344 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [09:52:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3311 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P24004 and previous config saved to /var/cache/conftool/dbconfig/20220331-095228-root.json [09:52:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:41] (03CR) 10Giuseppe Lavagetto: [C: 04-1] gitlab_runner: override User of default service unit file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/775815 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [09:52:49] (03CR) 10Elukey: [C: 03+2] Move ml-serve pod configs to Istio proxy sidecars [deployment-charts] - 10https://gerrit.wikimedia.org/r/775345 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [09:53:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1132.eqiad.wmnet with reason: Maintenance [09:53:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1132.eqiad.wmnet with reason: Maintenance [09:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance [09:53:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance [09:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T298557)', diff saved to https://phabricator.wikimedia.org/P24005 and previous config saved to /var/cache/conftool/dbconfig/20220331-095319-marostegui.json [09:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:21] 10SRE, 10Product-Infrastructure-Team-Backlog, 10Traffic-Icebox, 10WMDE-GeoInfo-FocusArea, 10Maps (Kartotherian): Geoshapes service is not sending 'access-control-allow-origin' header to some requests - https://phabricator.wikimedia.org/T241644 (10TheDJ) Tagging WMDE, as this looks like something that the... [09:53:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:27] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [09:54:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host miscweb2002.codfw.wmnet [09:54:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:56] (03PS2) 10Jelto: gitlab_runner: override User of default service unit file [puppet] - 10https://gerrit.wikimedia.org/r/775815 (https://phabricator.wikimedia.org/T295481) [09:57:36] (03CR) 10Jelto: gitlab_runner: override User of default service unit file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/775815 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [09:58:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P24006 and previous config saved to /var/cache/conftool/dbconfig/20220331-095809-marostegui.json [09:58:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:30] (03PS11) 10Klausman: hiera: Add ML staging k8s role [puppet] - 10https://gerrit.wikimedia.org/r/774488 (https://phabricator.wikimedia.org/T302195) [09:58:34] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34637/console" [puppet] - 10https://gerrit.wikimedia.org/r/775815 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [09:59:26] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 127, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:59:26] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [09:59:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:05] mvolz: My dear minions, it's time we take the moon! Just kidding. Time for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220331T1000). [10:00:06] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [10:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:35] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 94, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:00:45] !log pool cp4029 with HAProxy as TLS termination layer - T290005 [10:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:50] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [10:01:35] (03PS12) 10Klausman: hiera: Add ML staging k8s role [puppet] - 10https://gerrit.wikimedia.org/r/774488 (https://phabricator.wikimedia.org/T302195) [10:01:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host miscweb2002.codfw.wmnet [10:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:33] (03PS1) 10Elukey: Fix istio injection namespace labels values for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/775816 [10:03:07] 10SRE, 10ops-codfw: Dell switches testing - https://phabricator.wikimedia.org/T290133 (10ayounsi) a:05ayounsi→03Papaul [10:03:33] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host miscweb1002.eqiad.wmnet [10:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:39] (03PS13) 10Klausman: hiera: Add ML staging k8s role [puppet] - 10https://gerrit.wikimedia.org/r/774488 (https://phabricator.wikimedia.org/T302195) [10:04:40] (03CR) 10Giuseppe Lavagetto: [C: 03+1] gitlab_runner: override User of default service unit file [puppet] - 10https://gerrit.wikimedia.org/r/775815 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [10:04:48] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3065 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [10:05:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host miscweb1002.eqiad.wmnet [10:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:23] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: override User of default service unit file [puppet] - 10https://gerrit.wikimedia.org/r/775815 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [10:06:28] (03PS14) 10Klausman: hiera: Add ML staging k8s role [puppet] - 10https://gerrit.wikimedia.org/r/774488 (https://phabricator.wikimedia.org/T302195) [10:07:10] PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:07:28] (03CR) 10Elukey: [C: 03+2] Fix istio injection namespace labels values for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/775816 (owner: 10Elukey) [10:08:09] (03PS15) 10Klausman: hiera: Add ML staging k8s role [puppet] - 10https://gerrit.wikimedia.org/r/774488 (https://phabricator.wikimedia.org/T302195) [10:08:53] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34642/console" [puppet] - 10https://gerrit.wikimedia.org/r/774488 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [10:10:33] PROBLEM - MariaDB Replica Lag: s7 #page on db1181 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 21470.13 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:10:36] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34643/console" [puppet] - 10https://gerrit.wikimedia.org/r/774488 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [10:10:45] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:10:46] * volans here [10:10:53] here [10:10:58] I think it is from a schema change [10:11:04] (03PS16) 10Klausman: hiera: Add ML staging k8s role [puppet] - 10https://gerrit.wikimedia.org/r/774488 (https://phabricator.wikimedia.org/T302195) [10:11:06] I think Amir1 was doing s7 [10:11:08] let me check [10:11:18] (03CR) 10Klausman: [C: 03+2] hiera: Add ML staging k8s role [puppet] - 10https://gerrit.wikimedia.org/r/774488 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [10:11:23] (03CR) 10Klausman: [V: 03+2 C: 03+2] hiera: Add ML staging k8s role [puppet] - 10https://gerrit.wikimedia.org/r/774488 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [10:11:25] <_joe_> marostegui: do we need to depool that server? [10:11:28] the host is depooled [10:11:30] so no user impact [10:11:32] around if needed [10:11:33] I am going to silence it [10:11:35] ack [10:11:36] <_joe_> ok [10:11:38] (03CR) 10Elukey: [C: 03+1] "LGTM! Before merging I'd suggest to change the commit msg to reflect more what you are doing (it is not anymore limited to hiera)." [puppet] - 10https://gerrit.wikimedia.org/r/774488 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [10:11:42] * volans acking on VO [10:11:45] Amir1: I just started replication there, but please check why it paged [10:11:51] <_joe_> yeah maybe we need to silence that alert while we're depooled [10:11:59] what is the issue? [10:12:09] _joe_: the autmatic schema change does it, but maybe it took longer than expected or it failed [10:12:16] anyways, no impact [10:12:23] ah, sorry, I had the scroll bad [10:12:34] just jumped out of the bed [10:12:36] okay [10:12:37] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [10:12:41] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [10:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:44] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P24007 and previous config saved to /var/cache/conftool/dbconfig/20220331-101314-marostegui.json [10:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:44] resolved the VO page [10:13:50] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [10:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:13] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp3056.esams.wmnet with OS buster [10:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:22] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp3056.esams.wmnet with OS buster [10:14:42] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3065 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [10:15:33] marostegui: aah found it: [10:15:34] ``` [10:15:38] https://www.irccloud.com/pastebin/KbmQ2fwq/ [10:15:42] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:15:58] ^ gitlab alert is expected [10:16:23] and broke it until the alert expired [10:17:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host debmonitor2002.codfw.wmnet [10:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:14] Amir1: ok, I have started replication anyways [10:19:02] now we need to repool + make sure this error doesn't break auto schema in the future [10:19:15] I go eat something first [10:19:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2002.codfw.wmnet [10:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:27] (03PS1) 10Jelto: gitlab_runner: override ExecStart of default service unit file [puppet] - 10https://gerrit.wikimedia.org/r/775821 (https://phabricator.wikimedia.org/T295481) [10:20:46] RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:22:53] (03CR) 10Jelto: "This slipped through in my previous change Ia1f92fbce33b1e93007f6547080f33348b74bc8e." [puppet] - 10https://gerrit.wikimedia.org/r/775821 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [10:26:30] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [10:26:33] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [10:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:39] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [10:26:42] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [10:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:58] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10MatthewVernon) Thanks. The reason puppet is failing on all these nodes is that they don't appear (to the OS kernel) to have their usual set of spinning disks... [10:28:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T297189)', diff saved to https://phabricator.wikimedia.org/P24009 and previous config saved to /var/cache/conftool/dbconfig/20220331-102819-marostegui.json [10:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:26] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [10:28:38] RECOVERY - MariaDB Replica Lag: s7 #page on db1181 is OK: OK slave_sql_lag Replication lag: 0.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:30:31] (03PS1) 10Muehlenhoff: Only apply automated restarts for imagecatalog on the active deployment server [puppet] - 10https://gerrit.wikimedia.org/r/775822 (https://phabricator.wikimedia.org/T305135) [10:30:34] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host debmonitor1002.eqiad.wmnet [10:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1002.eqiad.wmnet [10:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:02] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/775822 (https://phabricator.wikimedia.org/T305135) (owner: 10Muehlenhoff) [10:34:51] (03PS1) 10Klausman: labs: Add dummy token for istio-cni on ML staging k8s [labs/private] - 10https://gerrit.wikimedia.org/r/775823 [10:37:03] PROBLEM - PyBal connections to etcd on lvs2010 is CRITICAL: CRITICAL: 86 connections established with conf2004.codfw.wmnet:4001 (min=87) https://wikitech.wikimedia.org/wiki/PyBal [10:37:33] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.72:6443]) https://wikitech.wikimedia.org/wiki/PyBal [10:38:58] (KubernetesRsyslogDown) firing: rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:39:23] (03CR) 10Muehlenhoff: "Looks good, one comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/775821 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [10:41:11] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3056.esams.wmnet with reason: host reimage [10:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:58] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:44:41] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3056.esams.wmnet with reason: host reimage [10:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:37] (03PS1) 10Elukey: Update helmfie_istio-proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/775824 (https://phabricator.wikimedia.org/T297612) [10:45:39] (03PS1) 10Elukey: Enable istio proxy injection in ml-serve's kserve ns and update settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/775825 (https://phabricator.wikimedia.org/T297612) [10:45:58] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:48:40] PROBLEM - SSH on aqs1007.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:48:58] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:49:55] (03CR) 10JMeybohm: [C: 03+1] Update helmfie_istio-proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/775824 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [10:50:32] (03CR) 10Elukey: [C: 03+2] Update helmfie_istio-proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/775824 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [10:50:39] (03CR) 10Elukey: [C: 03+2] Enable istio proxy injection in ml-serve's kserve ns and update settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/775825 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [10:51:20] RECOVERY - Check systemd state on kafka-test1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:52:06] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.04839 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [10:53:12] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [10:53:16] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [10:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:30] RECOVERY - Check systemd state on kafka-test1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:53:37] (03CR) 10Cathal Mooney: [C: 03+2] analytics1-b/c/d-eqiad: replace firewall filter with strict uRPF [homer/public] - 10https://gerrit.wikimedia.org/r/775805 (https://phabricator.wikimedia.org/T298087) (owner: 10Ayounsi) [10:53:58] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:54:20] (03Merged) 10jenkins-bot: analytics1-b/c/d-eqiad: replace firewall filter with strict uRPF [homer/public] - 10https://gerrit.wikimedia.org/r/775805 (https://phabricator.wikimedia.org/T298087) (owner: 10Ayounsi) [10:55:10] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [10:55:12] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [10:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:46] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [10:55:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:58] (KubernetesRsyslogDown) resolved: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:04:09] (03CR) 10JMeybohm: Add helm charts and a helmfile configuration for datahub (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [11:04:38] RECOVERY - Check systemd state on gitlab-runner1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:24] PROBLEM - Confd template for /srv/config-master/pybal/codfw/ml-staging-ctrl on puppetmaster1001 is CRITICAL: File not found: /srv/config-master/pybal/codfw/ml-staging-ctrl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:50] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.72:6443]) https://wikitech.wikimedia.org/wiki/PyBal [11:06:58] (03CR) 10JMeybohm: [C: 03+2] Ensure the data in kubernetes secrets is ordered by key [deployment-charts] - 10https://gerrit.wikimedia.org/r/774528 (owner: 10JMeybohm) [11:08:18] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3056.esams.wmnet with OS buster [11:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:26] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp3056.esams.wmnet with OS buster com... [11:08:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "No need to wait for reviews in this kind of trivial change." [puppet] - 10https://gerrit.wikimedia.org/r/774854 (https://phabricator.wikimedia.org/T304916) (owner: 10David Caro) [11:08:53] 10SRE, 10Data-Persistence-Backup, 10media-backups, 10Goal, 10Patch-For-Review: Document media recovery use case proposals and decide their priority - https://phabricator.wikimedia.org/T299764 (10jcrespo) A summary of my conclusions after the feedback were docummented at: https://docs.google.com/document/... [11:12:06] (03Merged) 10jenkins-bot: Ensure the data in kubernetes secrets is ordered by key [deployment-charts] - 10https://gerrit.wikimedia.org/r/774528 (owner: 10JMeybohm) [11:13:45] (03PS1) 10KartikMistry: Enable Content and Section Translation for Persian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775829 (https://phabricator.wikimedia.org/T296475) [11:14:08] PROBLEM - Check systemd state on wdqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:15:36] PROBLEM - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 66 connections established with conf2004.codfw.wmnet:4001 (min=67) https://wikitech.wikimedia.org/wiki/PyBal [11:16:24] !log pool cp3056 with HAProxy as TLS termination layer - T290005 [11:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:29] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [11:18:08] PROBLEM - Confd template for /srv/config-master/pybal/codfw/ml-staging-ctrl on puppetmaster2001 is CRITICAL: File not found: /srv/config-master/pybal/codfw/ml-staging-ctrl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:19:20] !log installing libpcap security updates [11:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:29] (03PS48) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) [11:19:47] (03CR) 10Btullis: Add helm charts and a helmfile configuration for datahub (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [11:20:18] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host pybal-test2002.codfw.wmnet [11:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pybal-test2002.codfw.wmnet [11:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:14] PROBLEM - Check systemd state on an-worker1101 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:27:43] --^ looking [11:35:29] (03PS1) 10Muehlenhoff: Disable automated restarts for ipmiseld [puppet] - 10https://gerrit.wikimedia.org/r/775840 (https://phabricator.wikimedia.org/T135991) [11:37:25] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/775840 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:37:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T298557)', diff saved to https://phabricator.wikimedia.org/P24010 and previous config saved to /var/cache/conftool/dbconfig/20220331-113730-marostegui.json [11:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:36] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [11:39:40] !log depool cp3057 for reimage - T290005 [11:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:46] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [11:41:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host pybal-test2003.codfw.wmnet [11:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pybal-test2003.codfw.wmnet [11:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:10] moritzm: Since this morning's merge of https://gerrit.wikimedia.org/r/c/operations/puppet/+/775281 it has highlighted that a number of hosts weren't running ipmiseld - Not sure why yet. [11:43:55] I've started it manually on e.g. an-db1001 and it runs OK, but can't immediately see why it shouldn't have started automatically. [11:46:22] (03CR) 10Muehlenhoff: [C: 03+2] Disable automated restarts for ipmiseld [puppet] - 10https://gerrit.wikimedia.org/r/775840 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:46:29] (03PS2) 10Muehlenhoff: Disable automated restarts for ipmiseld [puppet] - 10https://gerrit.wikimedia.org/r/775840 (https://phabricator.wikimedia.org/T135991) [11:47:46] btullis: yeah, I've also started to look into this, now I'm going to disable the auto restart to avoid further alert spam, but the underlying issue is really that ipmiseld wasn't running, I'll open a task for this [11:48:10] going to clean out Icinga when 775840 is merged [11:49:36] RECOVERY - SSH on aqs1007.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:50:57] (03PS2) 10MMandere: site: Reimage cp3057 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/775323 (https://phabricator.wikimedia.org/T290005) [11:51:55] (03CR) 10MMandere: [C: 03+2] site: Reimage cp3057 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/775323 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [11:52:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P24011 and previous config saved to /var/cache/conftool/dbconfig/20220331-115235-marostegui.json [11:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:50] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp3057.esams.wmnet with OS buster [11:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:57] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp3057.esams.wmnet with OS buster [11:55:18] back [11:56:04] RECOVERY - Check systemd state on wdqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:56:56] RECOVERY - Check systemd state on wdqs2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:57:30] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:58:48] RECOVERY - Check systemd state on wdqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:58:48] RECOVERY - Check systemd state on wdqs2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:58:48] RECOVERY - Check systemd state on wdqs2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:02:25] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.12 point update - https://phabricator.wikimedia.org/T304546 (10MoritzMuehlenhoff) [12:03:34] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:04:22] !log installing wireshark security updates [12:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:22] (03PS1) 10Bartosz Dziewoński: Fix error/warning boxes on signup form [core] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/775432 (https://phabricator.wikimedia.org/T305098) [12:07:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P24012 and previous config saved to /var/cache/conftool/dbconfig/20220331-120742-marostegui.json [12:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:54] !log depool cp4023 for reimage - T290005 [12:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:00] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [12:09:21] (03CR) 10Tacsipacsi: "Thanks for backporting it! (By the way, I don’t seem to be able to vote on this, so I moved myself to CC.)" [core] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/775432 (https://phabricator.wikimedia.org/T305098) (owner: 10Bartosz Dziewoński) [12:09:24] (03PS2) 10MMandere: site: Reimage cp4023 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/775324 (https://phabricator.wikimedia.org/T290005) [12:11:20] (03CR) 10MMandere: [C: 03+2] site: Reimage cp4023 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/775324 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [12:12:57] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp4023.ulsfo.wmnet with OS buster [12:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:06] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp4023.ulsfo.wmnet with OS buster [12:16:42] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:17:12] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.12 point update - https://phabricator.wikimedia.org/T304546 (10MoritzMuehlenhoff) [12:22:10] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3057.esams.wmnet with reason: host reimage [12:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T298557)', diff saved to https://phabricator.wikimedia.org/P24013 and previous config saved to /var/cache/conftool/dbconfig/20220331-122247-marostegui.json [12:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:52] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [12:25:31] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3057.esams.wmnet with reason: host reimage [12:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:36] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4023.ulsfo.wmnet with reason: host reimage [12:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:57] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4023.ulsfo.wmnet with reason: host reimage [12:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:08] PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:39:19] (03PS1) 10Lucas Werkmeister (WMDE): Revert "Add SpecialPageFatalTest to @group Database" [core] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/775435 [12:39:25] (03PS1) 10Lucas Werkmeister (WMDE): Revert "GlobalUsersPager: add gu_id to GROUP BY" [extensions/CentralAuth] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/775436 [12:41:38] 10SRE, 10ops-eqsin, 10DC-Ops, 10Patch-For-Review: Q2(Need By: TBD) rack/setup/install new mr1-eqsin - https://phabricator.wikimedia.org/T294872 (10ayounsi) @RobH ping? [12:42:22] (03PS1) 10Bartosz Dziewoński: ChangeTags: Use localizer with correct page title to parse messages [core] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/775437 (https://phabricator.wikimedia.org/T302754) [12:43:37] PROBLEM - MariaDB read only db_inventory #page on db2093 is CRITICAL: CRIT: read_only: True, expected False: OK: Version 10.4.22-MariaDB-log, Uptime 77742s, event_scheduler: True, 119.75 QPS, connection latency: 0.004384s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [12:43:58] argh, that's me, sorry. [12:44:04] * Emperor arrives [12:44:36] ACKNOWLEDGEMENT - MariaDB read only db_inventory #page on db2093 is CRITICAL: CRIT: read_only: True, expected False: OK: Version 10.4.22-MariaDB-log, Uptime 77742s, event_scheduler: True, 119.75 QPS, connection latency: 0.004384s Kormat That was me. https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [12:45:23] RECOVERY - MariaDB read only db_inventory #page on db2093 is OK: Version 10.4.22-MariaDB-log, Uptime 77848s, read_only: False, event_scheduler: True, 111.86 QPS, connection latency: 0.004380s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [12:47:16] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Suboptimal anycast routing from leaf switches - https://phabricator.wikimedia.org/T302315 (10ayounsi) 05Open→03Resolved DoH is advertised from drmrs, I'll leave it to Traffic to decide about the anycast NS. [12:48:04] !log analytics1-b/c/d-eqiad: replace firewall filter with strict uRPF - T298087 [12:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:36] (03PS1) 10Kormat: mariadb: Stop special-casing db2093 [puppet] - 10https://gerrit.wikimedia.org/r/775852 (https://phabricator.wikimedia.org/T301315) [12:49:55] (03PS2) 10Kormat: mariadb: Stop special-casing db2093 [puppet] - 10https://gerrit.wikimedia.org/r/775852 (https://phabricator.wikimedia.org/T301315) [12:50:55] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3057.esams.wmnet with OS buster [12:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:04] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp3057.esams.wmnet with OS buster com... [12:51:33] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34644/console" [puppet] - 10https://gerrit.wikimedia.org/r/775852 (https://phabricator.wikimedia.org/T301315) (owner: 10Kormat) [12:52:17] (03CR) 10Kormat: mariadb: Stop special-casing db2093 [puppet] - 10https://gerrit.wikimedia.org/r/775852 (https://phabricator.wikimedia.org/T301315) (owner: 10Kormat) [12:53:36] !log pool cp3057 with HAProxy as TLS termination layer - T290005 [12:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:42] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [12:53:58] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4023.ulsfo.wmnet with OS buster [12:54:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:06] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp4023.ulsfo.wmnet with OS buster com... [13:00:05] RoanKattouw, Lucas_WMDE, and Urbanecm: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220331T1300). [13:00:05] Lucas_WMDE and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:08] o/ [13:00:26] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Revert "Add SpecialPageFatalTest to @group Database" [core] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/775435 (owner: 10Lucas Werkmeister (WMDE)) [13:00:28] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Revert "GlobalUsersPager: add gu_id to GROUP BY" [extensions/CentralAuth] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/775436 (owner: 10Lucas Werkmeister (WMDE)) [13:00:35] hi [13:00:54] (03PS1) 10Elukey: Update istio mesh settings for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/775855 (https://phabricator.wikimedia.org/T297612) [13:01:03] hi! [13:01:41] (03PS2) 10Lucas Werkmeister (WMDE): Configure `mul` language code on Test Wikidata and its clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755453 (https://phabricator.wikimedia.org/T297393) [13:03:09] !log pool cp4023 with HAProxy as TLS termination layer - T290005 [13:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:15] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [13:03:32] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Configure `mul` language code on Test Wikidata and its clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755453 (https://phabricator.wikimedia.org/T297393) (owner: 10Lucas Werkmeister (WMDE)) [13:04:35] (03Merged) 10jenkins-bot: Configure `mul` language code on Test Wikidata and its clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755453 (https://phabricator.wikimedia.org/T297393) (owner: 10Lucas Werkmeister (WMDE)) [13:04:55] alright, let’s start with that [13:05:23] testing on mwdebug1001 first [13:06:50] looks good, syncing in two steps [13:08:10] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:755453|Configure `mul` language code on Test Wikidata and its clients (T297393)]] (1/2) (duration: 00m 51s) [13:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:15] T297393: Implement basic version of mul language code and deploy to Test Wikidata - https://phabricator.wikimedia.org/T297393 [13:09:13] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:755453|Configure `mul` language code on Test Wikidata and its clients (T297393)]] (2/2) (duration: 00m 50s) [13:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:55] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3226 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [13:12:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:12:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:48] (03PS1) 10MVernon: Makefile, docs: Note this is now obsolete [software/swift-ring] - 10https://gerrit.wikimedia.org/r/775856 (https://phabricator.wikimedia.org/T265117) [13:14:38] (03CR) 10MVernon: "Hi," [software/swift-ring] - 10https://gerrit.wikimedia.org/r/775856 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [13:17:09] (03PS2) 10Func: Add logo variants for zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775416 (https://phabricator.wikimedia.org/T273578) [13:17:23] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3226 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [13:17:23] 10SRE-swift-storage, 10Patch-For-Review: Consider swift ring management automation - https://phabricator.wikimedia.org/T265117 (10MatthewVernon) 05Open→03Resolved Deployed, new process documented, and a CR in to the old tooling to document its obsolescence. [13:17:43] 10SRE-swift-storage: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621 (10Ottomata) Oh ho, we should talk! https://wikitech.wikimedia.org/wiki/Shared_Data_Platform [13:19:03] (03Merged) 10jenkins-bot: Revert "Add SpecialPageFatalTest to @group Database" [core] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/775435 (owner: 10Lucas Werkmeister (WMDE)) [13:19:55] ^ syncing that one [13:20:39] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.39.0-wmf.5/tests/phpunit/structure/SpecialPageFatalTest.php: Backport: [[gerrit:775435|Revert "Add SpecialPageFatalTest to @group Database"]] (no-op) (duration: 00m 50s) [13:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:08] you’re kidding me, Zuul is restarting the CentralAuth gate-and-submit from scratch?! [13:21:11] >.< [13:21:50] centralauth doesn't take long [13:22:10] you’re right, it’s predicting 3 minutes now [13:22:19] still annoying though [13:22:31] what confuses me is that zuul tries to merge the master patch through gate-and-submit-wmf [13:22:52] (I’ll also backport these to wmf.4, but I don’t want to upload wmf.4 cherry-picks until wmf.5 is fully merged) [13:23:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:23:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:57] the original patches aren't in wmf.4 [13:24:05] ah, okay [13:24:11] that’s good, one thing less to worry about :) [13:24:22] (03Merged) 10jenkins-bot: Revert "GlobalUsersPager: add gu_id to GROUP BY" [extensions/CentralAuth] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/775436 (owner: 10Lucas Werkmeister (WMDE)) [13:24:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:24:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:35] 10SRE, 10Infrastructure-Foundations, 10observability, 10User-MoritzMuehlenhoff: ipmiseld not running reliably - https://phabricator.wikimedia.org/T305147 (10MoritzMuehlenhoff) [13:25:06] testing on mwdebug1001 [13:25:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:15] seems fine as far as I can tell, let’s sync [13:27:50] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.39.0-wmf.5/extensions/CentralAuth/includes/Special/GlobalUsersPager.php: Backport: [[gerrit:775436|Revert "GlobalUsersPager: add gu_id to GROUP BY"]] (duration: 00m 50s) [13:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:20] (03PS2) 10Lucas Werkmeister (WMDE): Fix error/warning boxes on signup form [core] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/775432 (https://phabricator.wikimedia.org/T305098) (owner: 10Bartosz Dziewoński) [13:28:42] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Fix error/warning boxes on signup form [core] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/775432 (https://phabricator.wikimedia.org/T305098) (owner: 10Bartosz Dziewoński) [13:29:04] * Lucas_WMDE looks again at that ChangeTags backport [13:30:05] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [13:30:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:31:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:31:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:51] (03CR) 10Klausman: hiera: Add ML staging k8s role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774488 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [13:32:15] (03CR) 10Klausman: [C: 03+2] labs: Add dummy token for istio-cni on ML staging k8s [labs/private] - 10https://gerrit.wikimedia.org/r/775823 (owner: 10Klausman) [13:32:57] PROBLEM - DPKG on idp-test1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [13:34:04] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] ChangeTags: Use localizer with correct page title to parse messages [core] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/775437 (https://phabricator.wikimedia.org/T302754) (owner: 10Bartosz Dziewoński) [13:36:04] (03PS1) 10Klausman: hiera/modules: Add config for ML staging k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/775860 (https://phabricator.wikimedia.org/T302195) [13:36:54] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for prometheus-atlas-exporter [puppet] - 10https://gerrit.wikimedia.org/r/775861 (https://phabricator.wikimedia.org/T135991) [13:37:32] (03PS2) 10Klausman: hiera/modules: Add config for ML staging k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/775860 (https://phabricator.wikimedia.org/T302195) [13:38:24] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34646/console" [puppet] - 10https://gerrit.wikimedia.org/r/775860 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [13:41:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netmon2001.wikimedia.org [13:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:50] (03CR) 10Elukey: hiera/modules: Add config for ML staging k8s workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/775860 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [13:45:53] (03PS3) 10Klausman: hiera/modules: Add config for ML staging k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/775860 (https://phabricator.wikimedia.org/T302195) [13:46:03] (03CR) 10Klausman: hiera/modules: Add config for ML staging k8s workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/775860 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [13:46:11] (03PS9) 10Giuseppe Lavagetto: k8s: add module [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879) [13:47:07] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34647/console" [puppet] - 10https://gerrit.wikimedia.org/r/775860 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [13:47:22] (03Merged) 10jenkins-bot: Fix error/warning boxes on signup form [core] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/775432 (https://phabricator.wikimedia.org/T305098) (owner: 10Bartosz Dziewoński) [13:47:58] MatmaRex: the error/warning boxes on signup form should be fixed on mwdebug1001, can you check? [13:49:21] yeah [13:49:29] (03CR) 10Giuseppe Lavagetto: k8s: add module (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879) (owner: 10Giuseppe Lavagetto) [13:49:34] ok [13:49:56] Lucas_WMDE: looks good [13:50:00] alright, thanks! [13:50:42] syncing [13:51:01] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10ayounsi) On that second part, we discussed it within Infrastructure Foundation. With the webproxies (and url-downloader)... [13:51:14] (03CR) 10Elukey: [C: 03+1] "The nodes will need to be rebooted after applying the first puppet run to get all the "old" cgroups (the kubelet will complain as it has h" [puppet] - 10https://gerrit.wikimedia.org/r/775860 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [13:51:27] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.39.0-wmf.5/resources/src/mediawiki.special.createaccount/HtmlformChecker.js: Backport: [[gerrit:775432|Fix error/warning boxes on signup form (T305098)]] (duration: 00m 50s) [13:51:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2001.wikimedia.org [13:51:30] Lucas_WMDE: the other patch can't be easily tested, the issue is intermitted and it wasn't occurring for me this morning, but there's a logstash search we can check later [13:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:33] T305098: Error and warning boxes in account creation appear incorrectly - https://phabricator.wikimedia.org/T305098 [13:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:39] alright [13:51:40] intermittent* [13:51:47] then I’ll sync that directly [13:51:53] (03CR) 10Klausman: [V: 03+1] hiera/modules: Add config for ML staging k8s workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/775860 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [13:52:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:52:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:10] (03CR) 10Klausman: [V: 03+1 C: 03+2] hiera/modules: Add config for ML staging k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/775860 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [13:52:41] (03PS2) 10Jelto: gitlab_runner: override ExecStart in service unit for non-root [puppet] - 10https://gerrit.wikimedia.org/r/775821 (https://phabricator.wikimedia.org/T295481) [13:52:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netmon1002.wikimedia.org [13:52:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:52:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:52:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:06] (03Merged) 10jenkins-bot: ChangeTags: Use localizer with correct page title to parse messages [core] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/775437 (https://phabricator.wikimedia.org/T302754) (owner: 10Bartosz Dziewoński) [13:53:11] !log depool cp5009 for reimage - T290005 [13:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:17] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [13:53:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:32] (03PS1) 10JMeybohm: Don't add ingress port twice to networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/775865 [13:54:11] ah, forgot to rebase that least one [13:54:16] so now there’s a merge commit. oh well [13:55:56] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.39.0-wmf.5/includes/changetags/ChangeTags.php: Backport: [[gerrit:775437|ChangeTags: Use localizer with correct page title to parse messages (T302754)]] (duration: 00m 51s) [13:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:02] T302754: Fatal exception of type "BadMethodCallException" on MediaWiki.org Special:RecentChanges filter widget - https://phabricator.wikimedia.org/T302754 [13:56:15] !log UTC afternoon backport+config window done [13:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:19] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34648/console" [puppet] - 10https://gerrit.wikimedia.org/r/775821 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [13:58:06] (03PS1) 10Volans: cookbooks.sre: SREBatchRunnerBase improve sleeps [cookbooks] - 10https://gerrit.wikimedia.org/r/775867 [13:58:08] (03PS1) 10Volans: sre.cdn.roll-restart-varnish: improve SAL log [cookbooks] - 10https://gerrit.wikimedia.org/r/775868 [13:58:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:58:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:42] thanks Lucas_WMDE [13:59:48] np [14:00:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1002.wikimedia.org [14:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:40] (03CR) 10JMeybohm: "While reading through all the rendered YAML I found another edge case not yet being handled properly by our scaffold templates." [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [14:01:18] (03PS1) 10DCausse: team-search-platform: alert when flink task slots is too low [alerts] - 10https://gerrit.wikimedia.org/r/775869 (https://phabricator.wikimedia.org/T305068) [14:01:31] (03PS1) 10Elukey: Improve secrets chart's rendering [deployment-charts] - 10https://gerrit.wikimedia.org/r/775870 [14:02:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:02:48] !log installing vim security updates on buster [14:02:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:08] (03PS2) 10Elukey: Improve secrets chart's rendering [deployment-charts] - 10https://gerrit.wikimedia.org/r/775870 [14:03:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:03:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:58] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp5009.eqsin.wmnet with OS buster [14:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:04] (03PS49) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) [14:04:14] (03CR) 10Muehlenhoff: [C: 03+2] Extend edges alias to also include drmrs now that the site is live [puppet] - 10https://gerrit.wikimedia.org/r/773452 (owner: 10Muehlenhoff) [14:04:17] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp5009.eqsin.wmnet with OS buster [14:04:38] (03CR) 10Btullis: Add helm charts and a helmfile configuration for datahub (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [14:04:58] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:05:52] !log mmandere@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5009.eqsin.wmnet with OS buster [14:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:00] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp5009.eqsin.wmnet with OS buster exe... [14:06:08] (03CR) 10Ayounsi: [C: 03+1] Enable profile::auto_restarts::service for prometheus-atlas-exporter [puppet] - 10https://gerrit.wikimedia.org/r/775861 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:06:12] (03PS3) 10MMandere: site: Reimage cp5009 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/775325 (https://phabricator.wikimedia.org/T290005) [14:07:29] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:08:12] (03CR) 10MMandere: [C: 03+2] site: Reimage cp5009 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/775325 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [14:09:47] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp5009.eqsin.wmnet with OS buster [14:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:57] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp5009.eqsin.wmnet with OS buster [14:10:45] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:11:43] PROBLEM - puppet last run on cp1077 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:11:47] (03CR) 10Ayounsi: [C: 03+1] sre.cdn.roll-restart-varnish: improve SAL log (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/775868 (owner: 10Volans) [14:11:53] PROBLEM - puppet last run on cp4034 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:11:53] PROBLEM - puppet last run on cp4022 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:11:59] PROBLEM - puppet last run on cp6002 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:12:03] PROBLEM - puppet last run on cp2040 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:12:05] PROBLEM - puppet last run on cp4030 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:12:09] PROBLEM - puppet last run on cp6010 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:12:27] PROBLEM - puppet last run on cp1075 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:12:29] PROBLEM - puppet last run on cp1076 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:12:35] PROBLEM - puppet last run on cp2036 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:12:35] PROBLEM - puppet last run on cp4024 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:12:35] PROBLEM - puppet last run on cp4025 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:12:35] PROBLEM - puppet last run on cp6007 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:12:41] PROBLEM - puppet last run on cp5014 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:12:54] what's up with puppet and cp hosts? [14:12:56] checking [14:13:05] PROBLEM - puppet last run on cp4036 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:13:07] PROBLEM - puppet last run on cp5001 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:13:08] ^ yeah, seems like Puppet is disabled [14:13:09] PROBLEM - puppet last run on cp2028 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:13:12] PROBLEM - puppet last run on cp6015 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:13:17] PROBLEM - puppet last run on cp6009 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:13:17] PROBLEM - puppet last run on cp6004 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:13:18] mmandere: any ideas on this? [14:13:19] PROBLEM - puppet last run on cp5005 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:13:27] PROBLEM - puppet last run on cp4035 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:13:28] is puppet disabled? [14:13:31] PROBLEM - puppet last run on cp5015 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:13:31] PROBLEM - puppet last run on cp5003 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:13:33] I don't see any run on puppetboard [14:13:35] PROBLEM - puppet last run on cp4032 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:13:38] and the last ones were correct [14:13:43] PROBLEM - puppet last run on cp2042 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:13:45] PROBLEM - puppet last run on cp2039 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:13:45] PROBLEM - puppet last run on cp6013 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:13:47] PROBLEM - puppet last run on cp1089 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:13:48] (03CR) 10Ayounsi: [C: 03+1] cookbooks.sre: SREBatchRunnerBase improve sleeps [cookbooks] - 10https://gerrit.wikimedia.org/r/775867 (owner: 10Volans) [14:13:53] PROBLEM - puppet last run on cp3060 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:13:57] PROBLEM - puppet last run on cp3053 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:14:01] yep, last puppet run (475 minutes ago [14:14:01] PROBLEM - puppet last run on cp5016 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:14:03] volans: not sure, checking why that is the case [14:14:19] PROBLEM - puppet last run on cp1082 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:14:23] PROBLEM - puppet last run on cp5002 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:14:27] PROBLEM - puppet last run on cp1083 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:14:31] but is not explicitly disabled, weird [14:14:31] PROBLEM - puppet last run on cp1088 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:14:31] PROBLEM - puppet last run on cp2034 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:14:51] sukhe: yeah, _joe_ has just enabled puppet on those hosts [14:15:12] PROBLEM - puppet last run on cp3063 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:15:14] but was it disabled before that? [14:15:19] PROBLEM - puppet last run on cp6011 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:15:19] PROBLEM - puppet last run on cp1090 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:15:22] PROBLEM - puppet last run on cp3064 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:15:25] PROBLEM - puppet last run on cp6006 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:15:37] PROBLEM - puppet last run on cp1080 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:15:37] PROBLEM - puppet last run on cp2031 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:15:38] volans: yes it was, we missed enabling it ealier [14:15:45] <_joe_> yeah sorry [14:15:45] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:15:47] PROBLEM - puppet last run on cp1086 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:15:49] PROBLEM - puppet last run on cp2035 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:15:56] <_joe_> this alert is annoying [14:15:59] PROBLEM - puppet last run on cp6001 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:15:59] PROBLEM - puppet last run on cp6005 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:16:03] <_joe_> now we'll also get the recoveries [14:16:05] PROBLEM - puppet last run on cp3061 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:16:05] PROBLEM - puppet last run on cp5012 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:16:15] PROBLEM - puppet last run on cp2032 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:16:25] PROBLEM - puppet last run on cp2027 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:16:29] PROBLEM - puppet last run on cp3058 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:16:37] PROBLEM - puppet last run on cp6014 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:16:37] PROBLEM - puppet last run on cp3052 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:16:38] ofc :D [14:16:39] PROBLEM - puppet last run on cp5013 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:16:39] PROBLEM - puppet last run on cp5011 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:16:41] PROBLEM - puppet last run on cp2030 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:16:42] (the recovieries) [14:16:45] PROBLEM - puppet last run on cp6003 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:16:45] <_joe_> yeah [14:16:49] PROBLEM - puppet last run on cp2038 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:16:57] PROBLEM - puppet last run on cp1084 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:16:59] PROBLEM - puppet last run on cp3062 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:16:59] PROBLEM - puppet last run on cp6012 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:16:59] PROBLEM - puppet last run on cp6016 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:17:07] PROBLEM - puppet last run on cp5008 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:17:11] PROBLEM - puppet last run on cp4021 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:17:25] PROBLEM - puppet last run on cp3051 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:17:29] PROBLEM - puppet last run on cp2037 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:17:45] PROBLEM - puppet last run on cp3055 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:17:45] PROBLEM - puppet last run on cp6008 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:17:47] PROBLEM - puppet last run on cp5006 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:17:47] PROBLEM - puppet last run on cp5010 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:18:45] RECOVERY - puppet last run on cp4024 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:18:45] RECOVERY - puppet last run on cp6007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:19:15] RECOVERY - puppet last run on cp5001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:19:21] RECOVERY - puppet last run on cp6015 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:19:39] RECOVERY - puppet last run on cp4035 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:19:46] (03PS2) 10Volans: sre.cdn.roll-restart-varnish: improve SAL log [cookbooks] - 10https://gerrit.wikimedia.org/r/775868 [14:19:47] RECOVERY - puppet last run on cp4032 is OK: OK: Puppet is currently enabled, last run 6 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:19:52] (03CR) 10Volans: "addressed comment" [cookbooks] - 10https://gerrit.wikimedia.org/r/775868 (owner: 10Volans) [14:19:57] RECOVERY - puppet last run on cp6013 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:20:09] RECOVERY - puppet last run on cp3053 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:20:13] RECOVERY - puppet last run on cp5016 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:20:29] RECOVERY - puppet last run on cp1082 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:22:12] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.12 point update - https://phabricator.wikimedia.org/T304546 (10MoritzMuehlenhoff) [14:22:13] !log (late) about 5 hours ago, I removed /var/run/php/use-config-schema from mw1415 to disable config schema loading (T304460) [14:22:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:22] T304460: Roll out loading of default settings via SettingsBuilder - https://phabricator.wikimedia.org/T304460 [14:22:30] RECOVERY - puppet last run on cp3062 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:22:55] (03CR) 10Volans: [C: 03+1] "LGTM, ship it!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879) (owner: 10Giuseppe Lavagetto) [14:23:14] RECOVERY - puppet last run on cp2040 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:23:14] RECOVERY - puppet last run on cp4030 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:23:31] RECOVERY - puppet last run on cp3063 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:23:44] RECOVERY - puppet last run on cp2036 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:24:00] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "Quick before prospector changes idea about our codebase!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879) (owner: 10Giuseppe Lavagetto) [14:24:16] RECOVERY - puppet last run on cp4036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:24:28] RECOVERY - puppet last run on cp5005 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:25:10] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10Isaac) Chiming in as a heavy user of the stat boxes. It's difficult for me to follow this conversation so I'm mainly askin... [14:25:36] RECOVERY - puppet last run on cp5002 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:25:40] RECOVERY - puppet last run on cp1083 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:25:40] RECOVERY - puppet last run on cp1090 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:25:44] RECOVERY - puppet last run on cp1088 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:27:10] RECOVERY - puppet last run on cp3052 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:27:11] RECOVERY - puppet last run on cp5013 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:27:11] RECOVERY - puppet last run on cp5011 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:27:26] RECOVERY - puppet last run on cp2038 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:27:34] RECOVERY - puppet last run on cp1084 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:27:44] RECOVERY - puppet last run on cp5008 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:27:46] RECOVERY - puppet last run on cp4021 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:27:46] RECOVERY - puppet last run on cp3051 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:27:54] RECOVERY - puppet last run on cp6008 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:27:56] RECOVERY - puppet last run on cp5010 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:28:00] RECOVERY - puppet last run on cp6011 is OK: OK: Puppet is currently enabled, last run 7 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:28:12] RECOVERY - puppet last run on cp4034 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:28:12] RECOVERY - puppet last run on cp4022 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:28:30] RECOVERY - puppet last run on cp6010 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:28:58] RECOVERY - puppet last run on cp4025 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:30:58] RECOVERY - puppet last run on cp2034 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:31:10] PROBLEM - Host doh2002 is DOWN: PING CRITICAL - Packet loss = 100% [14:31:14] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:31:30] ^ expected, rebooting doh hosts [14:31:57] (03Merged) 10jenkins-bot: k8s: add module [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879) (owner: 10Giuseppe Lavagetto) [14:32:04] RECOVERY - puppet last run on cp2027 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:32:30] RECOVERY - puppet last run on cp2030 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:32:30] RECOVERY - Host doh2002 is UP: PING OK - Packet loss = 0%, RTA = 31.86 ms [14:32:34] RECOVERY - puppet last run on cp6003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:32:54] RECOVERY - puppet last run on cp6016 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:33:16] RECOVERY - puppet last run on cp3055 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:33:18] RECOVERY - puppet last run on cp2035 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:33:41] RECOVERY - puppet last run on cp6002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:33:51] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 127, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:34:02] RECOVERY - puppet last run on cp2032 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:34:10] RECOVERY - puppet last run on cp1075 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:34:24] RECOVERY - puppet last run on cp5014 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:34:34] RECOVERY - puppet last run on cp3064 is OK: OK: Puppet is currently enabled, last run 6 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:35:04] RECOVERY - puppet last run on cp6009 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:35:18] RECOVERY - puppet last run on cp5015 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:35:18] RECOVERY - puppet last run on cp5003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:35:32] RECOVERY - puppet last run on cp2039 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:35:38] RECOVERY - puppet last run on cp1089 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:36:06] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5009.eqsin.wmnet with reason: host reimage [14:36:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:52] RECOVERY - puppet last run on cp6005 is OK: OK: Puppet is currently enabled, last run 6 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:37:34] RECOVERY - puppet last run on cp3058 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:37:50] RECOVERY - puppet last run on cp6014 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:38:13] (03CR) 10JMeybohm: [C: 03+1] Improve secrets chart's rendering [deployment-charts] - 10https://gerrit.wikimedia.org/r/775870 (owner: 10Elukey) [14:38:24] RECOVERY - puppet last run on cp6012 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:38:34] (03CR) 10Volans: [C: 03+2] cookbooks.sre: SREBatchRunnerBase improve sleeps [cookbooks] - 10https://gerrit.wikimedia.org/r/775867 (owner: 10Volans) [14:38:40] RECOVERY - puppet last run on cp2037 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:38:44] (03CR) 10Volans: [C: 03+2] sre.cdn.roll-restart-varnish: improve SAL log [cookbooks] - 10https://gerrit.wikimedia.org/r/775868 (owner: 10Volans) [14:38:50] RECOVERY - puppet last run on cp5006 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:38:54] RECOVERY - puppet last run on cp1077 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:39:35] RECOVERY - puppet last run on cp1076 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:39:39] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5009.eqsin.wmnet with reason: host reimage [14:39:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:17] RECOVERY - puppet last run on cp2028 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:40:21] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10Kubernetes: Create Spicerack cook book to drain/reboot/uncordon a Kubernetes worker - https://phabricator.wikimedia.org/T212866 (10JMeybohm) [14:40:27] RECOVERY - puppet last run on cp6004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:40:31] RECOVERY - puppet last run on cp2031 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:40:33] RECOVERY - puppet last run on cp2042 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:40:37] RECOVERY - puppet last run on cp1080 is OK: OK: Puppet is currently enabled, last run 15 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:40:43] RECOVERY - puppet last run on cp3060 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:40:54] 10SRE, 10Generated Data Platform, 10serviceops, 10Service-deployment-requests: Setup Initial Image Suggestion Service CI and k8s params/stubs - https://phabricator.wikimedia.org/T305154 (10WDoranWMF) [14:41:42] 10SRE, 10Infrastructure-Foundations, 10observability, 10User-MoritzMuehlenhoff: ipmiseld not running reliably - https://phabricator.wikimedia.org/T305147 (10herron) Looks like ipmiseld isn't enabled on a sampling of these hosts, letting puppet ensure the service is enabled and running seems like a good nex... [14:41:57] PROBLEM - Host doh3001 is DOWN: PING CRITICAL - Packet loss = 100% [14:42:13] RECOVERY - Host doh3001 is UP: PING OK - Packet loss = 0%, RTA = 81.39 ms [14:42:23] PROBLEM - BFD status on cr3-esams is CRITICAL: CRIT: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:42:39] (03Merged) 10jenkins-bot: cookbooks.sre: SREBatchRunnerBase improve sleeps [cookbooks] - 10https://gerrit.wikimedia.org/r/775867 (owner: 10Volans) [14:42:55] (03PS1) 10Herron: ipmiseld: ensure service enabled and running [puppet] - 10https://gerrit.wikimedia.org/r/775875 (https://phabricator.wikimedia.org/T305147) [14:42:59] PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Connect - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:43:03] (03Merged) 10jenkins-bot: sre.cdn.roll-restart-varnish: improve SAL log [cookbooks] - 10https://gerrit.wikimedia.org/r/775868 (owner: 10Volans) [14:43:05] ^ expected, should resolve [14:43:17] PROBLEM - Host doh3002 is DOWN: PING CRITICAL - Packet loss = 100% [14:43:27] (03CR) 10jerkins-bot: [V: 04-1] ipmiseld: ensure service enabled and running [puppet] - 10https://gerrit.wikimedia.org/r/775875 (https://phabricator.wikimedia.org/T305147) (owner: 10Herron) [14:43:29] (03CR) 10Elukey: [C: 03+2] Improve secrets chart's rendering [deployment-charts] - 10https://gerrit.wikimedia.org/r/775870 (owner: 10Elukey) [14:43:31] sukhe: this one too? ^^^ [14:43:39] RECOVERY - Host doh3002 is UP: PING OK - Packet loss = 0%, RTA = 81.41 ms [14:43:44] I guess so :D [14:43:57] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on doh4001.wikimedia.org with reason: reboot for kernel update T304938 [14:44:00] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on doh4001.wikimedia.org with reason: reboot for kernel update T304938 [14:44:00] 10SRE, 10Generated Data Platform, 10serviceops, 10Service-deployment-requests: Blubber setup for Image Suggestions Service - https://phabricator.wikimedia.org/T305155 (10WDoranWMF) [14:44:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:09] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 0:10:00 on doh4002.wikimedia.org with reason: reboot for kernel update T304938 [14:44:11] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on doh4002.wikimedia.org with reason: reboot for kernel update T304938 [14:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:33] (03PS2) 10Herron: ipmiseld: ensure service enabled and running [puppet] - 10https://gerrit.wikimedia.org/r/775875 (https://phabricator.wikimedia.org/T305147) [14:45:19] RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:45:55] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10Kubernetes: Create Spicerack cookbook to drain/reboot/uncordon a Kubernetes worker - https://phabricator.wikimedia.org/T212866 (10Volans) [14:45:59] RECOVERY - BFD status on cr3-esams is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:46:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:46:17] PROBLEM - Check systemd state on doh3002 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:46:20] (03PS1) 10Esanders: Remove unused Flow config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775876 [14:46:24] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [14:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:02] !log depool cp6016 for reimage - T290005 [14:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:07] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [14:47:27] RECOVERY - puppet last run on cp6006 is OK: OK: Puppet is currently enabled, last run 15 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:47:52] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:47:55] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:47:57] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: CRIT: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:49:05] RECOVERY - Check systemd state on doh3002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:49:43] RECOVERY - puppet last run on cp3061 is OK: OK: Puppet is currently enabled, last run 27 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:50:23] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [14:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:55] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:50:59] RECOVERY - BFD status on cr3-ulsfo is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:51:19] 10SRE-OnFire: 2021-11-02 Cloud VPS networking - https://phabricator.wikimedia.org/T299964 (10jcrespo) [14:51:23] 10SRE-swift-storage: Swift users and their usage - https://phabricator.wikimedia.org/T264291 (10Ottomata) We should add https://wikitech.wikimedia.org/wiki/Shared_Data_Platform as a potential usage of object storage. This would be a **very** large usage. [14:51:53] (03PS2) 10MMandere: site: Reimage cp6016 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/775326 (https://phabricator.wikimedia.org/T290005) [14:51:55] RECOVERY - puppet last run on cp1086 is OK: OK: Puppet is currently enabled, last run 29 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:52:02] RECOVERY - puppet last run on cp6001 is OK: OK: Puppet is currently enabled, last run 14 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:52:17] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [14:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:34] 10SRE-OnFire (FY2021/2022-Q2): 2021-11-02 Cloud VPS networking - https://phabricator.wikimedia.org/T299964 (10jcrespo) [14:52:39] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 0:10:00 on doh5001.wikimedia.org with reason: reboot for kernel update T304938 [14:52:42] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on doh5001.wikimedia.org with reason: reboot for kernel update T304938 [14:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:46] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 0:10:00 on doh5002.wikimedia.org with reason: reboot for kernel update T304938 [14:52:48] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on doh5002.wikimedia.org with reason: reboot for kernel update T304938 [14:52:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:23] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:54:59] ^ again, related to reboot, these should clear up on their own [14:55:09] (03CR) 10MMandere: [C: 03+2] site: Reimage cp6016 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/775326 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [14:55:09] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:56:06] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [14:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:50] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 0:10:00 on doh6001.wikimedia.org with reason: reboot for kernel update T304938 [14:56:52] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on doh6001.wikimedia.org with reason: reboot for kernel update T304938 [14:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:58] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 0:10:00 on doh6002.wikimedia.org with reason: reboot for kernel update T304938 [14:56:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:00] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on doh6002.wikimedia.org with reason: reboot for kernel update T304938 [14:57:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:29] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6016.drmrs.wmnet with OS buster [14:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:39] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6016.drmrs.wmnet with OS buster [14:58:32] (03CR) 10Jdlrobson: [C: 03+1] "This looks good. Should be safe to backport as soon as today, but won't come into effect on zh.wikipedia.org until at earliest a week toda" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775416 (https://phabricator.wikimedia.org/T273578) (owner: 10Func) [14:59:05] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10cmooney) I suspect the partman definition isn’t speced correctly for them, or maybe the hardware is different in terms of disk layout. Not something I know... [15:00:41] 10SRE-OnFire (FY2021/2022-Q2), 10cloud-services-team (Kanban): 2021-11-02 Cloud VPS networking - https://phabricator.wikimedia.org/T299964 (10jcrespo) Adding @aborrero and @dcaro to see if they can bring the proposal to do a trial run with some cloud team members of the post-mortem/review process. Context: ht... [15:04:20] (03CR) 10Elukey: [C: 03+1] Don't add ingress port twice to networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/775865 (owner: 10JMeybohm) [15:04:34] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: [WIP] Requesting access to deployment group for TThoabala - https://phabricator.wikimedia.org/T303398 (10herron) >>! In T303398#7796432, @jbond wrote: > Change to stalled until TsepoThoabala return When is TsepoThoabala expected to return? [15:05:21] 10SRE, 10Infrastructure-Foundations, 10observability, 10Patch-For-Review, 10User-MoritzMuehlenhoff: ipmiseld not running reliably - https://phabricator.wikimedia.org/T305147 (10herron) p:05Triage→03Medium [15:05:30] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [15:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:36] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 0:10:00 on durum[1001-1002].eqiad.wmnet with reason: reboot for update T304938 [15:05:39] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [15:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:40] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on durum[1001-1002].eqiad.wmnet with reason: reboot for update T304938 [15:05:43] 10SRE, 10Generated Data Platform, 10serviceops, 10Service-deployment-requests: Setup Initial Image Suggestion Service CI and k8s params/stubs - https://phabricator.wikimedia.org/T305154 (10herron) p:05Triage→03Medium [15:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:01] RECOVERY - puppet last run on cp5012 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:06:02] 10SRE, 10Generated Data Platform, 10serviceops, 10Service-deployment-requests: Blubber setup for Image Suggestions Service - https://phabricator.wikimedia.org/T305155 (10herron) p:05Triage→03Medium [15:06:50] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [15:06:53] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [15:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:14] (03CR) 10Elukey: [C: 03+2] Update istio mesh settings for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/775855 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [15:07:26] (03PS1) 10Volans: sre.cdn.roll-restart-varnish: fix condition [cookbooks] - 10https://gerrit.wikimedia.org/r/775879 [15:08:31] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:09:17] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10Ottomata) > malware accidentally downloaded (compromised library dependency, infected executable, etc) could easily "phone... [15:10:02] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on 12 hosts with reason: reboot for update T304938 [15:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:18] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5009.eqsin.wmnet with OS buster [15:10:20] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 12 hosts with reason: reboot for update T304938 [15:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:27] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp5009.eqsin.wmnet with OS buster com... [15:10:41] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [15:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:31] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [15:11:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:00] 10SRE-OnFire (FY2021/2022-Q2), 10cloud-services-team (Kanban): 2021-11-02 Cloud VPS networking - https://phabricator.wikimedia.org/T299964 (10lmata) 05Open→03In progress [15:13:02] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [15:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:15] !log pool cp5009 with HAProxy as TLS termination layer - T290005 [15:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:21] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [15:13:53] (03PS1) 10Volans: admin: add personal .tmux.conf to my home [puppet] - 10https://gerrit.wikimedia.org/r/775883 [15:15:05] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6016.drmrs.wmnet with reason: host reimage [15:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:10] (03CR) 10Volans: [C: 03+2] "trivial personal customization, self-merging" [puppet] - 10https://gerrit.wikimedia.org/r/775883 (owner: 10Volans) [15:16:56] (03CR) 10Volans: [C: 03+2] "Trivial fix, self-merging" [cookbooks] - 10https://gerrit.wikimedia.org/r/775879 (owner: 10Volans) [15:18:11] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6016.drmrs.wmnet with reason: host reimage [15:18:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:36] PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:19:57] 10SRE, 10ops-codfw, 10decommission-hardware: decommission kubernetes200[1-4] - https://phabricator.wikimedia.org/T303045 (10Papaul) a:03Papaul [15:20:36] (03PS1) 10Elukey: role::ml_k8s::worker: add calico/istio CNI settings to ml-serve-codfw [puppet] - 10https://gerrit.wikimedia.org/r/775884 (https://phabricator.wikimedia.org/T297612) [15:20:59] (03Merged) 10jenkins-bot: sre.cdn.roll-restart-varnish: fix condition [cookbooks] - 10https://gerrit.wikimedia.org/r/775879 (owner: 10Volans) [15:21:33] RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:21:37] (03CR) 10Elukey: [C: 03+2] role::ml_k8s::worker: add calico/istio CNI settings to ml-serve-codfw [puppet] - 10https://gerrit.wikimedia.org/r/775884 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [15:22:11] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:23:49] PROBLEM - BFD status on cr3-eqsin is CRITICAL: CRIT: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:24:28] ^ expected, durum updates [15:24:57] RECOVERY - BFD status on cr3-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:25:13] 10SRE-OnFire (FY2021/2022-Q3): incidents occurring during Q2 have been scored with the scorecard - https://phabricator.wikimedia.org/T292254 (10jcrespo) [15:25:47] 10SRE-OnFire (FY2021/2022-Q3): incidents occurring during Q2 have been scored with the scorecard - https://phabricator.wikimedia.org/T292254 (10jcrespo) T299964 scored, left open because it may have the ritual run on it. [15:27:03] PROBLEM - BFD status on cr3-esams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:27:33] PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:28:21] RECOVERY - BFD status on cr3-esams is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:28:47] RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:29:57] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:31:27] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 105, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:35:19] 10SRE, 10ops-codfw: Dell switches testing: Setup mgmt for two servers for testing - https://phabricator.wikimedia.org/T305070 (10Papaul) [15:35:27] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:38] (03CR) 10Jelto: [V: 03+1] "This change is ready for review." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/775821 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [15:36:03] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10BTullis) >>! In T300977#7821926, @Ottomata wrote: > > I appreciate the intention here, but I'm not sure if the combo of t... [15:38:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1133.eqiad.wmnet with reason: Maintenance [15:38:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1133.eqiad.wmnet with reason: Maintenance [15:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:18] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10Ottomata) > its primary goal is limiting the capability of any such malware to 'phone home' to a command & control endpoin... [15:41:57] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6016.drmrs.wmnet with OS buster [15:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:07] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6016.drmrs.wmnet with OS buster com... [15:43:58] (03CR) 10JMeybohm: [C: 03+2] Don't add ingress port twice to networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/775865 (owner: 10JMeybohm) [15:44:22] !log pool cp6016 with HAProxy as TLS termination layer - T290005 [15:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:28] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [15:44:43] (03Abandoned) 10Klausman: hiera: Add ML staging k8s ctrl node config [puppet] - 10https://gerrit.wikimedia.org/r/772417 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [15:45:23] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [15:45:26] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [15:45:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:33] (03Merged) 10jenkins-bot: Don't add ingress port twice to networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/775865 (owner: 10JMeybohm) [15:51:16] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:51:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:02] 10SRE-OnFire (FY2021/2022-Q2): 2021-10-29 graphite - https://phabricator.wikimedia.org/T295157 (10jcrespo) Scored, up for a review @fgiunchedi in a week? Context: https://wikitech.wikimedia.org/wiki/Incident_review_ritual [15:53:18] 10SRE-OnFire (FY2021/2022-Q2): 2021-10-29 graphite - https://phabricator.wikimedia.org/T295157 (10jcrespo) 05Open→03In progress [15:54:29] 10SRE-OnFire (FY2021/2022-Q3): incidents occurring during Q2 have been scored with the scorecard - https://phabricator.wikimedia.org/T292254 (10jcrespo) [15:55:21] RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:59:29] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:59:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:04] jbond and rzl: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet request window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220331T1600). [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:09:57] 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Infrastructure-Foundations, 10SRE Observability (FY2021/2022-Q3): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10CDanis) Live on Diff: https://diff.wikimedia.org/2022/03/31/announcing-www-wikimedi... [16:11:10] !log bblack@cumin1001 START - Cookbook sre.hosts.reboot-single for host dns2002.wikimedia.org [16:11:11] !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host dns2002.wikimedia.org [16:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:36] (03PS2) 10Zabe: Start writing to $wmgLocalServices the same value as to $wmfLocalServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774497 (https://phabricator.wikimedia.org/T45956) [16:11:51] !log bblack@cumin1001 START - Cookbook sre.hosts.reboot-single for host dns2002.wikimedia.org [16:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:55] (03CR) 10JMeybohm: [C: 03+1] "From my POV this is now good to go! 🎉" [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [16:12:50] (03CR) 10Muehlenhoff: "Sounds good to me, but before we merge this, I'd like to have a deeper look _why_ it isn't running on some systems (could be outdated firm" [puppet] - 10https://gerrit.wikimedia.org/r/775875 (https://phabricator.wikimedia.org/T305147) (owner: 10Herron) [16:16:01] 10SRE, 10ops-codfw, 10decommission-hardware: decommission kubernetes200[1-4] - https://phabricator.wikimedia.org/T303045 (10Papaul) [16:16:25] 10SRE, 10ops-codfw, 10decommission-hardware: decommission kubernetes200[1-4] - https://phabricator.wikimedia.org/T303045 (10Papaul) 05Open→03Resolved complete [16:16:33] 10SRE, 10ops-codfw: Dell switches testing: Setup mgmt for two servers for testing - https://phabricator.wikimedia.org/T305070 (10Papaul) [16:17:08] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns2002.wikimedia.org [16:17:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 10%: Maint', diff saved to https://phabricator.wikimedia.org/P24015 and previous config saved to /var/cache/conftool/dbconfig/20220331-161709-ladsgroup.json [16:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:19] !log bblack@cumin1001 START - Cookbook sre.hosts.reboot-single for host dns1002.wikimedia.org [16:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:15] !log bblack@cumin1001 START - Cookbook sre.hosts.reboot-single for host dns6001.wikimedia.org [16:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:20] hashar, jeena, _joe_: did I understand correctly that since the train happened earlier, it will not happen in two hours? would it be ok for me to enable the new config loading code on mw1415 (web), mw1448 (api), and mw1438 (jobrunner) for a couple of hours? [16:24:57] PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:25:09] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 68 probes of 674 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:25:14] rzl: I see you are running the puppet window right now, do you have objections? [16:25:30] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns1002.wikimedia.org [16:25:31] !log bblack@cumin1001 START - Cookbook sre.hosts.reboot-single for host dns3002.wikimedia.org [16:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:52] duesen: there's nothing going on in the puppet window - I don't know enough about the change to have an opinion beyond that :) [16:26:16] rzl: ok, thanks [16:27:09] RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:27:25] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:27:45] PROBLEM - Host ns2-v4 is DOWN: PING CRITICAL - Packet loss = 100% [16:27:58] ^ expected, sorry for the noise [16:28:37] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns6001.wikimedia.org [16:28:38] !log bblack@cumin1001 START - Cookbook sre.hosts.reboot-single for host dns5001.wikimedia.org [16:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:51] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:29:51] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:30:36] dns reboot cause those, and I'm cycling them all (the BGP Status for Anycast alerts) [16:30:45] PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:30:59] PROBLEM - BFD status on cr3-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:30:59] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:31:21] duesen: that's right, you wouldn't be interrupting the train [16:31:23] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 65 probes of 674 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:31:35] jeena: thank you, I will go ahead then! [16:31:35] RECOVERY - Host ns2-v4 is UP: PING OK - Packet loss = 0%, RTA = 81.11 ms [16:32:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 25%: Maint', diff saved to https://phabricator.wikimedia.org/P24016 and previous config saved to /var/cache/conftool/dbconfig/20220331-163213-ladsgroup.json [16:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:59] RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:33:13] RECOVERY - BFD status on cr3-esams is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:33:13] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:33:27] !log creating /var/run/php/use-config-schema on canaries mw1415, mw1438, and mw1448 to enable config schema loading (T304460) [16:33:28] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns3002.wikimedia.org [16:33:29] !log bblack@cumin1001 START - Cookbook sre.hosts.reboot-single for host dns4002.wikimedia.org [16:33:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:33] T304460: Roll out loading of default settings via SettingsBuilder - https://phabricator.wikimedia.org/T304460 [16:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:43] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:33:57] PROBLEM - BFD status on cr3-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:35:31] (BlazegraphJvmQuakeWarnGC) resolved: (6) Blazegraph instance wdqs1004:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [16:35:59] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:36:13] RECOVERY - BFD status on cr3-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:37:20] (03PS1) 10Ladsgroup: auto_schema: Wrap starting replication with finally [software] - 10https://gerrit.wikimedia.org/r/775895 [16:37:20] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns5001.wikimedia.org [16:37:21] !log bblack@cumin1001 START - Cookbook sre.hosts.reboot-single for host dns4001.wikimedia.org [16:37:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:51] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:38:03] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:38:09] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:40:03] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:40:17] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:40:18] (03CR) 10jerkins-bot: [V: 04-1] auto_schema: Wrap starting replication with finally [software] - 10https://gerrit.wikimedia.org/r/775895 (owner: 10Ladsgroup) [16:40:22] (JobUnavailable) firing: (6) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:41:52] 10SRE, 10Infrastructure-Foundations: Advertised RSS/Atom feeds for wikimediastatus.net don't work - https://phabricator.wikimedia.org/T305174 (10Legoktm) [16:42:37] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns4002.wikimedia.org [16:42:38] !log bblack@cumin1001 START - Cookbook sre.hosts.reboot-single for host dns5002.wikimedia.org [16:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:19] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:43:21] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:43:25] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 70 probes of 673 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:44:31] RECOVERY - BFD status on cr3-ulsfo is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:44:45] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:44:47] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 75, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:44:53] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 105, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:45:16] (03CR) 10Ryan Kemper: [C: 03+2] team-search-platform: alert when flink task slots is too low [alerts] - 10https://gerrit.wikimedia.org/r/775869 (https://phabricator.wikimedia.org/T305068) (owner: 10DCausse) [16:45:22] (JobUnavailable) firing: (6) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:47:10] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns4001.wikimedia.org [16:47:11] !log bblack@cumin1001 START - Cookbook sre.hosts.reboot-single for host dns3001.wikimedia.org [16:47:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 50%: Maint', diff saved to https://phabricator.wikimedia.org/P24017 and previous config saved to /var/cache/conftool/dbconfig/20220331-164717-ladsgroup.json [16:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:27] PROBLEM - BFD status on cr3-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:47:41] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:47:59] (03Merged) 10jenkins-bot: team-search-platform: alert when flink task slots is too low [alerts] - 10https://gerrit.wikimedia.org/r/775869 (https://phabricator.wikimedia.org/T305068) (owner: 10DCausse) [16:49:41] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 65 probes of 673 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:49:43] RECOVERY - BFD status on cr3-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:50:23] (03PS2) 10Ladsgroup: auto_schema: Wrap starting replication with finally [software] - 10https://gerrit.wikimedia.org/r/775895 [16:51:22] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns5002.wikimedia.org [16:51:23] !log bblack@cumin1001 START - Cookbook sre.hosts.reboot-single for host dns6002.wikimedia.org [16:51:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:34] !log bblack@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs1020.eqiad.wmnet [16:51:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:59] PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:54:55] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns3001.wikimedia.org [16:54:56] !log bblack@cumin1001 START - Cookbook sre.hosts.reboot-single for host dns2001.wikimedia.org [16:54:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:05] RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:56:03] PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:57:44] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns6002.wikimedia.org [16:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:07] RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:58:09] (03PS2) 10Ryan Kemper: wdqs: tune jvmquake settings [puppet] - 10https://gerrit.wikimedia.org/r/775254 (owner: 10DCausse) [16:58:44] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1020.eqiad.wmnet [16:58:45] !log bblack@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs2010.codfw.wmnet [16:58:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:33] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:59:59] PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:00:07] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:00:17] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:02:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 75%: Maint', diff saved to https://phabricator.wikimedia.org/P24018 and previous config saved to /var/cache/conftool/dbconfig/20220331-170221-ladsgroup.json [17:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:53] RECOVERY - PyBal connections to etcd on lvs2010 is OK: OK: 87 connections established with conf2004.codfw.wmnet:4001 (min=87) https://wikitech.wikimedia.org/wiki/PyBal [17:10:19] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs2010.codfw.wmnet [17:10:21] !log bblack@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs3007.esams.wmnet [17:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:03] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:15:09] PROBLEM - Check if anycast-healthchecker and all configured threads are running on dns2001 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [17:15:34] (03PS1) 10Cwhite: logstash: provide events with bucket information [puppet] - 10https://gerrit.wikimedia.org/r/775899 (https://phabricator.wikimedia.org/T205013) [17:16:11] PROBLEM - Check systemd state on dns2001 is CRITICAL: CRITICAL - degraded: The following units failed: anycast-healthchecker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:54] (03PS2) 10Cwhite: logstash: provide events with bucket information [puppet] - 10https://gerrit.wikimedia.org/r/775899 (https://phabricator.wikimedia.org/T205013) [17:16:57] PROBLEM - Bird Internet Routing Daemon on dns2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [17:17:13] 10SRE, 10Infrastructure-Foundations, 10observability, 10Patch-For-Review, 10User-MoritzMuehlenhoff: ipmiseld not running reliably - https://phabricator.wikimedia.org/T305147 (10herron) Looking more closely I see all bullseye hosts have the unit enabled, while all buster hosts do not. ` cumin1001:~$ sudo... [17:17:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 100%: Maint', diff saved to https://phabricator.wikimedia.org/P24019 and previous config saved to /var/cache/conftool/dbconfig/20220331-171724-ladsgroup.json [17:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:42] Mar 31 16:57:36 dns2001 anycast-healthchecker[1129]: Invalid configuration: /var/run/anycast-healthchecker doesn't exit [17:17:47] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs3007.esams.wmnet [17:17:47] this is ... weird [17:17:48] !log bblack@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs4007.ulsfo.wmnet [17:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:15] the directory indeed doesn't exist but it should [17:20:39] RECOVERY - Check systemd state on dns2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:20:45] I fixed it by running mkdir manually because well, the directory should have been created [17:20:48] sukhe: yeah I'm looking at it, the root of it is an extra definition in /en/i [17:20:52] maybe it's a one-off [17:20:52] oh [17:21:01] I'm rebooting all of those [17:21:05] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:21:12] I don't know if the mkdir has anything to do with it [17:21:18] (03PS3) 10Herron: ipmiseld: ensure service enabled and running [puppet] - 10https://gerrit.wikimedia.org/r/775875 (https://phabricator.wikimedia.org/T305147) [17:21:27] RECOVERY - Bird Internet Routing Daemon on dns2001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [17:21:31] RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:21:41] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 127, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:21:49] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 94, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:21:52] (03CR) 10jerkins-bot: [V: 04-1] ipmiseld: ensure service enabled and running [puppet] - 10https://gerrit.wikimedia.org/r/775875 (https://phabricator.wikimedia.org/T305147) (owner: 10Herron) [17:21:54] it's mostly that the puppet agent couldn't run, because it couldn't reach the private vlan in the same row, because of a bad interface config [17:21:55] RECOVERY - Check if anycast-healthchecker and all configured threads are running on dns2001 is OK: OK: UP (pid=16608) and all threads (2) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [17:22:11] what I don't know yet is why we ended up with a bad interface config [17:22:21] (03PS4) 10Herron: ipmiseld: ensure service enabled and running [puppet] - 10https://gerrit.wikimedia.org/r/775875 (https://phabricator.wikimedia.org/T305147) [17:23:14] bblack: but what I wonder is what caused the directory to not exist, after a reboot? because it should have been created (and is on other hosts) when anycast-hc is installed [17:23:55] (03CR) 10Herron: ipmiseld: ensure service enabled and running (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/775875 (https://phabricator.wikimedia.org/T305147) (owner: 10Herron) [17:24:39] sukhe: puppet creates it [17:24:42] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs4007.ulsfo.wmnet [17:24:43] !log bblack@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs5003.eqsin.wmnet [17:24:43] puppet couldn't run [17:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:03] Mar 31 17:20:23 dns2001 puppet-agent[15254]: (/Stage[main]/Bird::Anycast_healthchecker/File[/var/run/anycast-healthchecker/]/ensure) created (corre [17:25:06] ctive) [17:25:32] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns2001.wikimedia.org [17:25:33] !log bblack@cumin1001 START - Cookbook sre.hosts.reboot-single for host dns1001.wikimedia.org [17:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:40] maybe the missing context here is that /var/run is a link to /run, and /run is a memory fs that autowipes on every reboot [17:26:20] normally the agent recreates it on the first agent run, but the first agent run never happened because it couldn't talk to the puppetmaster [17:26:28] interesting [17:26:37] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:26:37] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:26:42] ok, then I guess my step was preemptive because I was approaching it from a different angle :P [17:26:43] (03PS3) 10Ryan Kemper: wdqs: tune jvmquake settings [puppet] - 10https://gerrit.wikimedia.org/r/775254 (https://phabricator.wikimedia.org/T293862) (owner: 10DCausse) [17:27:46] well, the first thing I noticed was not so much the alerts here, but that the reboot script was having a hard time getting a puppet last-run timestamp [17:27:58] so I went to the host and tried to run puppet for myself, and it just hung forever [17:28:22] then I tried pinging the local puppetmaster and that failed (over ipv6) [17:28:36] then I looked at "ip -6 route" and saw an errant route for a network it's not actually attached to [17:28:40] this is dns2001 right? [17:28:43] yes [17:29:10] (03PS4) 10Ryan Kemper: wdqs: tune jvmquake settings [puppet] - 10https://gerrit.wikimedia.org/r/775254 (https://phabricator.wikimedia.org/T293862) (owner: 10DCausse) [17:29:49] ip -6 addr showed the reason for the route, there was an extra IPv6 on the main interface, in that private network [17:30:05] (which shouldn't be there, its only interface is attached to the public vlan) [17:30:19] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs5003.eqsin.wmnet [17:30:20] !log bblack@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs6003.drmrs.wmnet [17:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:39] then I looked at /etc/network/interfaces to see if that IP came from config, and it did. There was a line in the main interface stanza like: [17:30:42] up ip addr add 2620:0:860:103:208:80:153:77/64 dev ens2f0np0 [17:31:02] (in addition to the correct one that's in :3: (public for that row) rather than :103: (private)) [17:31:08] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34650/console" [puppet] - 10https://gerrit.wikimedia.org/r/775254 (https://phabricator.wikimedia.org/T293862) (owner: 10DCausse) [17:31:21] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns1001.wikimedia.org [17:31:24] so I deleted that line and manually deleted the IP from the interface itself, and then the agent ran fine [17:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:00] there's no record of why/when that bad line was added to interfaces file, I guess sometime since it was last rebooted, or we would've already had this problem [17:32:07] but it's too old to show up in syslog [17:32:09] my first instinct was if the bird6 is responsible for this or related to the durum6001 issue we had, but there is no IPv6 anycast on the recursor boxes anyway [17:32:09] (03PS5) 10Ryan Kemper: wdqs: tune jvmquake settings [puppet] - 10https://gerrit.wikimedia.org/r/775254 (https://phabricator.wikimedia.org/T293862) (owner: 10DCausse) [17:32:54] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] wdqs: tune jvmquake settings [puppet] - 10https://gerrit.wikimedia.org/r/775254 (https://phabricator.wikimedia.org/T293862) (owner: 10DCausse) [17:33:11] hmmm maybe logstash [17:37:38] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs6003.drmrs.wmnet [17:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:03] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:39:42] 10SRE, 10Infrastructure-Foundations: Advertised RSS/Atom feeds for wikimediastatus.net don't work - https://phabricator.wikimedia.org/T305174 (10herron) p:05Triage→03Medium Hey @Legoktm thanks for the report, yes looks like these were indeed set to inactive. That's been enabled and should be working now. [17:41:23] !log bblack@cumin1001 START - Cookbook sre.hosts.reboot-single for host authdns1001.wikimedia.org [17:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:02] (03PS1) 10Volans: [WIP] service: add new module to map service::catalog [software/spicerack] - 10https://gerrit.wikimedia.org/r/775904 [17:45:13] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:45:52] (03CR) 10CDanis: [C: 04-1] "Please also fix patch description typos" [puppet] - 10https://gerrit.wikimedia.org/r/773451 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [17:47:25] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:47:41] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host authdns1001.wikimedia.org [17:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:30] (03CR) 10CDanis: [C: 03+1] "Please fix typo in patch description but otherwise lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/773454 (owner: 10Giuseppe Lavagetto) [17:51:44] (03CR) 10CDanis: [C: 03+1] varnish::frontend: remove normalization for parameter [puppet] - 10https://gerrit.wikimedia.org/r/773455 (owner: 10Giuseppe Lavagetto) [17:51:58] !log bblack@cumin1001 START - Cookbook sre.hosts.reboot-single for host authdns2001.wikimedia.org [17:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:59] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:57:25] PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:57:33] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:57:37] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:57:39] !log bblack@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs1019.eqiad.wmnet [17:57:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:43] ^ more DNS reboots, should be the last of them [17:58:20] (03PS1) 10Ladsgroup: Set noflip for css rule that needs it [extensions/TimedMediaHandler] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/775443 (https://phabricator.wikimedia.org/T305156) [17:58:38] jouncebot: nowandnext [17:58:38] No deployments scheduled for the next 0 hour(s) and 1 minute(s) [17:58:38] In 0 hour(s) and 1 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220331T1800) [17:58:39] there will be some BFD/BGP for the lvs reboot as well [17:59:14] the train has deployed (EU time) so I use the time to deploy stuff [17:59:23] (03CR) 10Ladsgroup: [C: 03+2] Set noflip for css rule that needs it [extensions/TimedMediaHandler] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/775443 (https://phabricator.wikimedia.org/T305156) (owner: 10Ladsgroup) [18:00:05] hashar and jeena: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220331T1800). [18:02:23] the train is done jeena :) unless something exploded over the last few hours. [18:02:53] thanks hashar! [18:03:34] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1019.eqiad.wmnet [18:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:08:53] !log bblack@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs2009.codfw.wmnet [18:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:33] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:10:45] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:10:50] Need to perform a rolling restart of wdqs public, so going to roll a quick deploy and kill two birds with one stone [18:10:59] RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:11:25] Hopefully it's not too distracting with the lvs reboots, but if necessary I can backfill the log messages later if we're worried about channel noise [18:11:31] !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.109`. Pre-deploy tests passing on canary `wdqs1003` [18:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:46] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@ba88f51]: 0.3.109 [18:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:11] !log [WDQS Deploy] Tests passing following deploy of `0.3.109` on canary `wdqs1003`; proceeding to rest of fleet [18:13:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:58] ryankemper: are you worried about the lvs noise clouding your deploy noise or the other way around? :) [18:14:35] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host authdns2001.wikimedia.org [18:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:37] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 127, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:15:39] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 94, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:16:11] RECOVERY - PyBal connections to etcd on lvs2009 is OK: OK: 67 connections established with conf2004.codfw.wmnet:4001 (min=67) https://wikitech.wikimedia.org/wiki/PyBal [18:16:26] (03Merged) 10jenkins-bot: Set noflip for css rule that needs it [extensions/TimedMediaHandler] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/775443 (https://phabricator.wikimedia.org/T305156) (owner: 10Ladsgroup) [18:19:10] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@ba88f51]: 0.3.109 (duration: 07m 24s) [18:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:41] bblack: I sense sarcasm at play here but in the off chance you were serious, the latter [18:19:49] my apologies tho, that should be it for log noise [18:20:25] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs2009.codfw.wmnet [18:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:34] nah we expect this channel to be noisy :) [18:21:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:51] <3 [18:22:27] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.72:6443]) https://wikitech.wikimedia.org/wiki/PyBal [18:22:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:22:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:07] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.5/extensions/TimedMediaHandler/resources/ext.tmh.player.styles.less: Backport: [[gerrit:775443|Set noflip for css rule that needs it (T305156)]] (duration: 00m 51s) [18:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:14] T305156: Media player broken on Early adopters RTL wikis - https://phabricator.wikimedia.org/T305156 [18:23:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:35] jouncebot: now [18:32:35] For the next 1 hour(s) and 27 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220331T1800) [18:34:35] want to add some noise here by rebooting gerrit-replica [18:36:34] !log gerrit-replica.wikimedia.org short downtime, rebooting gerrit2001 [18:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:54] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp3054.esams.wmnet [18:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:23] this will cause some alerts about CI cloning from there.. 1 min :) [18:39:41] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:40:37] host is back. cloning is broken. sigh [18:40:58] ah, working again :) [18:41:22] didnt even get the expected alerts. ok then [18:41:48] !log bblack@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs3005.esams.wmnet [18:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:23] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:43:14] !log removing /var/run/php/use-config-schema from canaries mw1415, mw1438, and mw1448 to disable config schema loading (T304460) [18:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:20] T304460: Roll out loading of default settings via SettingsBuilder - https://phabricator.wikimedia.org/T304460 [18:46:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:47:37] Amir1: are you going to need mwdebug* in a moment? [18:47:53] mutante: nope, I'm done with deployment [18:47:58] ACK, thanks [18:48:03] rebooting those [18:48:48] apergos: I see you are using mwdebug1002 like right now. Would you mind a reboot? [18:49:13] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs3005.esams.wmnet [18:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:16] !log mwdebug1001 - rebooting [18:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:40] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3054.esams.wmnet [18:50:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:19] I wouldn't [18:52:38] mutante: feel free [18:52:48] apergos: ACK, thanks [18:53:16] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp5013.eqsin.wmnet [18:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:38] apergos: and..already back. those are fast [18:53:58] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp3050.esams.wmnet [18:54:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:09] !log bblack@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs4006.ulsfo.wmnet [18:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:54] !log scandium - rebooting [18:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:37] 10SRE, 10Observability-Metrics, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3), 10User-CDanis: Find links to grafana.wikimedia.org and change them to use the new URL format - https://phabricator.wikimedia.org/T211982 (10colewhite) [18:59:14] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4022.ulsfo.wmnet [18:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:22] !log https://parsoid-rt-tests.wikimedia.org/ - short downtime due to maintenance [18:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:35] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs4006.ulsfo.wmnet [18:59:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:08] !log testreduce1001 - rebooting [19:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:27] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:00:27] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:00:38] !log rzl@apt1001:~$ sudo -i reprepro -C main include bullseye-wikimedia /home/rzl/httpbb/bullseye/httpbb_0.0.1-1+deb11u1_source.changes [19:00:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:04] !log bblack@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs5002.eqsin.wmnet [19:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:32] !log testreduce1001 - needed manual nginx restart after reboot to make https://parsoid-rt-tests.wikimedia.org/ work again [19:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:33] !log doc.wikimedia.org - short downtime due to maintenance, rebooting doc1001 [19:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:37] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs5002.eqsin.wmnet [19:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:42] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3050.esams.wmnet [19:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:27] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5013.eqsin.wmnet [19:07:29] !log bblack@cumin1001 conftool action : set/weight=1; selector: cluster=ml_staging [19:07:30] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp1075.eqiad.wmnet [19:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:40] !log bblack@cumin1001 conftool action : set/pooled=yes; selector: cluster=ml_staging [19:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:22] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp2035.codfw.wmnet [19:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:05] !log bblack@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs6002.drmrs.wmnet [19:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:06] 10SRE, 10Wikimedia-Mailing-lists: Figure out if we can remove legacy domain support for mailing lists - https://phabricator.wikimedia.org/T280472 (10jhathaway) I did some quick analysis and it appears like the vast majority of traffic to the old addresses is spam. As an example, for the messages sent on March... [19:10:09] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [19:10:12] 10SRE, 10Wikimedia-Mailing-lists: Figure out if we can remove legacy domain support for mailing lists - https://phabricator.wikimedia.org/T280472 (10jhathaway) a:03jhathaway [19:10:19] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [19:11:53] !log bblack@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs1013.eqiad.wmnet [19:11:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:30] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4022.ulsfo.wmnet [19:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:03] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp5015.eqsin.wmnet [19:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:13] !log phab2001 - git-ssh.codfw - rebooting - might cause pybal alert [19:14:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:38] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp3051.esams.wmnet [19:14:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:52] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs6002.drmrs.wmnet [19:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:54] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1013.eqiad.wmnet [19:15:55] !log bblack@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs1014.eqiad.wmnet [19:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:43] PROBLEM - Host phab2001 is DOWN: PING CRITICAL - Packet loss = 100% [19:17:48] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=phab2001-vcs.codfw.wmnet [19:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:49] ACKNOWLEDGEMENT - Host phab2001 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn investigating, was already half broken [19:19:32] !log bblack@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs1018.eqiad.wmnet [19:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:49] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1014.eqiad.wmnet [19:20:50] !log bblack@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs1015.eqiad.wmnet [19:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:11] !log phab2001 - powercycling via mgmt [19:21:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:17] !log remove openjdk-8-jre from eqiad logstash nodes T301770 [19:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:23] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_imagecatalog.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:21:23] T301770: Remove obsolete Java 8 packages from logstash cluster - https://phabricator.wikimedia.org/T301770 [19:21:33] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh4_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled: git-ssh6_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:22:15] ACKNOWLEDGEMENT - PyBal backends health check on lvs2008 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh4_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled: git-ssh6_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled daniel_zahn we only have 1 server per DC :/ https://wikitech.wikimedia.org/wiki/PyBal [19:22:15] ACKNOWLEDGEMENT - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh4_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled: git-ssh6_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled daniel_zahn we only have 1 server per DC :/ https://wikitech.wikimedia.org/wiki/PyBal [19:22:25] PROBLEM - PyBal IPVS diff check on lvs2008 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.153.250:22, 2620:0:860:ed1a::3:fa:22]) https://wikitech.wikimedia.org/wiki/PyBal [19:22:58] those diff checks are not me, I don't think [19:23:05] git-ssh, etc [19:23:09] RECOVERY - Host phab2001 is UP: PING OK - Packet loss = 0%, RTA = 33.19 ms [19:23:23] bblack: yea, they are me. but that host just came back after I powercycled it [19:23:24] seems like mutante is working on those? it says daniel_zahn [19:23:27] :) [19:23:31] ack [19:23:33] give it a moment [19:23:55] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1018.eqiad.wmnet [19:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:14] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1015.eqiad.wmnet [19:24:15] !log bblack@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs1016.eqiad.wmnet [19:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:49] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=codfw,name=phab2001-vcs.codfw.wmnet [19:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:21] that machine already had a hardware RAM issue, glad it came back at all [19:26:41] !log bblack@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs2008.codfw.wmnet [19:26:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:19] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5015.eqsin.wmnet [19:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:04] (03PS1) 10Volans: sre.hosts.reboot-single: fail on Icinga unoptimal [cookbooks] - 10https://gerrit.wikimedia.org/r/775946 [19:28:06] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1016.eqiad.wmnet [19:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:47] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3051.esams.wmnet [19:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:32] well, the host is back and the service seems back ..all I am missing is the recovery [19:29:57] (03CR) 10BBlack: [C: 03+1] "Sounds great to me, thank you!" [cookbooks] - 10https://gerrit.wikimedia.org/r/775946 (owner: 10Volans) [19:31:53] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp3052.esams.wmnet [19:31:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:57] RECOVERY - PyBal IPVS diff check on lvs2008 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [19:32:43] bblack: lvs2008 also had "10.192.17.7 Pybal connections to etcd" alert .. that I don't think I was related to [19:32:55] glad to see that recovery now [19:33:06] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp5014.eqsin.wmnet [19:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:17] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([2620:0:860:ed1a::3:fa:22]) https://wikitech.wikimedia.org/wiki/PyBal [19:37:26] mutante: lvs2008 was me, eventually, we're now crossing over causing alerts on some of the same hosts from different things :) [19:37:43] but the lvs2010 one above is still real [19:37:45] yep:) ACK [19:37:52] PROBLEM - LVS zotero codfw port 4969/tcp - Zotero- zotero.svc.codfw.wmnet IPv4 #page on zotero.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:37:57] well, the exact same thing recovered on 2008 [19:38:06] hmmm ok [19:38:07] 👋 [19:38:11] but before that happened I was wondering as well why not [19:38:27] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs2008.codfw.wmnet [19:38:29] not sure about zotero being related [19:38:31] <_joe_> rzl: you do the rolling restart? [19:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:32] assume not, for now [19:38:37] ^ is that page expected from LVS work, or should I look for legit zotero causes-- okay thanks [19:38:47] _joe_: can do, checking first [19:38:49] there is ongoing work on LVS but I don't think it's related to zotero [19:38:58] so far [19:39:09] cpu doesn't look crazy high? but it was recently [19:39:20] RECOVERY - LVS zotero codfw port 4969/tcp - Zotero- zotero.svc.codfw.wmnet IPv4 #page on zotero.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 196 bytes in 1.148 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:40:21] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:40:55] <_joe_> rzl: it is heavily throttled AFAICS [19:40:55] bblack: I am starting to think the git-ssh alerts may need pybal restart. that service was down briefly but is back now but the alerts dont go away and I think it happened before and then pybal restart fixed it.. [19:41:11] _joe_: ack okay [19:41:12] mutante: it shouldn't, but I'm not saying it's not possible [19:41:27] _joe_: just codfw, or eqiad too? [19:41:36] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): Is it possible to put more RAM in cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet? - https://phabricator.wikimedia.org/T303840 (10Andrew) 05Open→03Invalid I believe that this issue is no longer relevant now that T302855 is resolv... [19:41:54] mutante: I assume you're talking about the ACK'd CRITs on lvs2010 and lvs2008, right? [19:41:57] PYBAL CRITICAL - CRITICAL - git-ssh4_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled: git-ssh6_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled [19:42:02] oh ugh that's my problem I was just looking at eqiad [19:42:07] okay, rolling the restart on codfw now [19:42:27] <_joe_> yeah there's like half the containers with high cpu usage [19:42:28] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/zotero: sync [19:42:30] bblack: yes, thaose are the ones. caused by phab2001 reboot. but I can ssh to that [19:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:39] Mar 31 19:38:24 lvs2008 pybal[2915]: [git-ssh6_22 IdleConnection] WARN: phab2001-vcs.codfw.wmnet (enabled/down/pooled): Connection to 2620:0:860:103:10:192:32:149:22 failed. [19:42:41] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/zotero: sync [19:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:50] bblack: ooh.. IPv6 only maybe then [19:42:51] mabe just ipv6 is borked? [19:42:55] yea [19:43:07] this happened before.. looking [19:43:20] !log Rolling-restarted zotero to un-wedge wedged pods with offscale high CPU [19:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:07] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3052.esams.wmnet [19:44:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:48] yep, looking better now [19:44:55] awesome [19:45:19] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:45:41] !log bblack@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs3006.esams.wmnet [19:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:51] !log phab2001 - systemctl restart ssh-phab [19:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:02] bblack: ssh 2620:0:860:103:10:192:32:149 works now. race condition.. during start up at some point the secondary IPv6 gets added to the interface.. but the "ssh-phab" service is already started and then doesnt listen on IPv6.. and that box has 2 (4) IPs. hope that's it now [19:49:53] ack, thanks! [19:51:53] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.81 ms [19:53:19] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [19:54:11] ok, i just need to have patience and realize you cant speed those up by just telling Icinga to reschedule asap [19:55:36] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs3006.esams.wmnet [19:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:45] 10SRE, 10ops-eqiad: Degraded RAID on thanos-be1003 - https://phabricator.wikimedia.org/T304873 (10Cmjohnson) Disk has been ordered You have successfully submitted request SR1089086900. [19:57:16] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=mw2251.codfw.wmnet [19:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:35] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=mw2252.codfw.wmnet [19:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:44] yeah, re-sched doesn't seem to do much for me today, it may just be so backlogged that it's kinda pointless [19:59:01] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=mw2271.codfw.wmnet [19:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:10] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=mw2272.codfw.wmnet [19:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:05] brennen: How many deployers does it take to do UTC late backport and config training deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220331T2000). [20:00:05] Sergi0 and zabe: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:07] 10SRE-Access-Requests: Access for new Data Platform Dev: Thomas Chin - https://phabricator.wikimedia.org/T305193 (10tchin) [20:00:16] !log bblack@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs4005.ulsfo.wmnet [20:00:18] (03CR) 10Volans: [C: 03+2] sre.hosts.reboot-single: fail on Icinga unoptimal [cookbooks] - 10https://gerrit.wikimedia.org/r/775946 (owner: 10Volans) [20:00:19] o/ [20:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:30] o/ [20:00:32] * thcipriani waves [20:00:34] o/ [20:01:00] (03CR) 10Thcipriani: [C: 03+2] Newcomer tasks: always align button and text to the right [extensions/GrowthExperiments] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/775371 (https://phabricator.wikimedia.org/T301825) (owner: 10Sergio Gimeno) [20:01:20] !log mw2251,mw2252 - canary appserver, rebooting [20:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:45] (03PS1) 10Andrew Bogott: OpenStack nova: Fix the regex hack that validates new VM names [puppet] - 10https://gerrit.wikimedia.org/r/775949 (https://phabricator.wikimedia.org/T304694) [20:03:46] thcipriani: if you could give me a minute to finish a reboot.. it prevents scap warnings [20:03:51] (03PS4) 10Kosta Harlan: GrowthExperiments: Add mailing list question for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773204 (https://phabricator.wikimedia.org/T303240) [20:03:58] mutante: sure thing, let me know when all's clear [20:04:07] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs4005.ulsfo.wmnet [20:04:11] thanks! on it [20:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:13] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2035.codfw.wmnet [20:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:29] (03PS5) 10Kosta Harlan: GrowthExperiments: Add mailing list question for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773204 (https://phabricator.wikimedia.org/T303240) [20:04:44] !log mw2271,mw2222 - canary appserver, rebooting [20:04:47] (03Merged) 10jenkins-bot: sre.hosts.reboot-single: fail on Icinga unoptimal [cookbooks] - 10https://gerrit.wikimedia.org/r/775946 (owner: 10Volans) [20:04:47] PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:04:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:29] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=mw2374.codfw.wmnet [20:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:38] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=mw2376.codfw.wmnet [20:05:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:03] (03CR) 10Thcipriani: [C: 03+2] Migrate $wmfServiceConfig to $wmgServiceConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774019 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:06:14] sukhe, bblack: the change for the reboot-single is now deployed, lmk if there is any issue (will exit with FAIL if icinga not optimal) [20:06:25] volans: <3 [20:06:27] PROBLEM - Host mw2252 is DOWN: PING CRITICAL - Packet loss = 100% [20:06:39] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:06:45] PROBLEM - Host mw2271 is DOWN: PING CRITICAL - Packet loss = 100% [20:06:47] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5014.eqsin.wmnet [20:06:47] RECOVERY - Host mw2252 is UP: PING OK - Packet loss = 0%, RTA = 32.79 ms [20:06:47] volans: thanks! [20:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:53] (03Merged) 10jenkins-bot: Migrate $wmfServiceConfig to $wmgServiceConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774019 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:06:59] !log bblack@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs5001.eqsin.wmnet [20:07:02] (03PS6) 10Kosta Harlan: GrowthExperiments: Add mailing list question for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773204 (https://phabricator.wikimedia.org/T303240) [20:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:04] (03PS1) 10Kosta Harlan: GrowthExperiments: Start mailing list campaign on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775951 (https://phabricator.wikimedia.org/T303240) [20:07:05] ACKNOWLEDGEMENT - Host mw2271 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn maintenance reboot [20:07:30] !log bblack@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs1017.eqiad.wmnet [20:07:30] 10SRE-Access-Requests: Access for new Data Platform Dev: Thomas Chin - https://phabricator.wikimedia.org/T305193 (10WDoranWMF) As @tchin 's manager I approve. [20:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:47] RECOVERY - Host mw2271 is UP: PING OK - Packet loss = 0%, RTA = 33.22 ms [20:08:29] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=codfw,name=mw2251.codfw.wmnet [20:08:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:44] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=codfw,name=mw2271.codfw.wmnet [20:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:08] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=codfw,name=mw2252.codfw.wmnet [20:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:15] PROBLEM - Host mw2374 is DOWN: PING CRITICAL - Packet loss = 100% [20:09:19] PROBLEM - Host mw2376 is DOWN: PING CRITICAL - Packet loss = 100% [20:09:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:09:29] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=codfw,name=mw2272.codfw.wmnet [20:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:43] (03CR) 10Kosta Harlan: [C: 04-2] "Not ready to start the campaign yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775951 (https://phabricator.wikimedia.org/T303240) (owner: 10Kosta Harlan) [20:10:15] RECOVERY - Host mw2374 is UP: PING OK - Packet loss = 0%, RTA = 33.21 ms [20:10:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:10:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:37] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=codfw,name=mw2374.codfw.wmnet [20:10:37] RECOVERY - Host mw2376 is UP: PING OK - Packet loss = 0%, RTA = 33.33 ms [20:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:45] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:11:05] thcipriani: all yours! [20:11:17] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=codfw,name=mw2376.codfw.wmnet [20:11:18] mutante: thanks! [20:11:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:50] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1075.eqiad.wmnet [20:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:02] zabe: your first one is on mwdebug1002 if there's anything to test there [20:12:13] (it doesn't seem to explode afaict :)) [20:12:26] also, all mwdebug* hosts have been rebooted recently, fyi [20:12:35] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs5001.eqsin.wmnet [20:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:49] nice fresh mwdebugs :D [20:13:07] thcipriani, lgtm (I also don't test further then checking if anything explodes and if logstash is clear) [20:13:15] (03PS3) 10Thcipriani: Start writing to $wmgLocalServices the same value as to $wmfLocalServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774497 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:13:30] (03CR) 10Thcipriani: [C: 03+2] Start writing to $wmgLocalServices the same value as to $wmfLocalServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774497 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:14:13] (03Merged) 10jenkins-bot: Start writing to $wmgLocalServices the same value as to $wmfLocalServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774497 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:14:41] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1017.eqiad.wmnet [20:14:43] zabe: that's what I figured, but I wanted to check with you in case there was something deeper you wanted to check :) [20:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:07] !log thcipriani@deploy1002 Synchronized wmf-config/PhpAutoPrepend.php: Config: [[gerrit:774019|Migrate $wmfServiceConfig to $wmgServiceConfig (T45956)]] (duration: 00m 50s) [20:16:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:13] T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956 [20:16:50] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2035.codfw.wmnet,service=ats-be [20:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:57] !log bblack@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs2007.codfw.wmnet [20:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:02] !log bblack@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs6001.drmrs.wmnet [20:18:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:39] zabe: could you rebase this one for me? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/774499 gerrit's UI is being...difficult [20:18:55] sure [20:21:10] (03PS3) 10Zabe: Migrate $wmfLocalServices to $wmgLocalServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774499 (https://phabricator.wikimedia.org/T45956) [20:21:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:35] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:21:51] !log contint2002 - reboot (insetup host) [20:21:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:58] thcipriani, done [20:22:07] thanks! Second one going live now [20:22:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:22:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:38] !log thcipriani@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:774497|Start writing to $wmgLocalServices the same value as to $wmfLocalServices (T45956)]] (duration: 00m 50s) [20:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:45] T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956 [20:22:51] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs6001.drmrs.wmnet [20:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:57] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 127, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:23:09] 10SRE-Access-Requests: Access for new Data Platform Dev: Thomas Chin - https://phabricator.wikimedia.org/T305193 (10herron) [20:23:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:23:30] (03CR) 10Thcipriani: [C: 03+2] Migrate $wmfLocalServices to $wmgLocalServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774499 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:14] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs2007.codfw.wmnet [20:24:15] (03Merged) 10jenkins-bot: Migrate $wmfLocalServices to $wmgLocalServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774499 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:38] (03PS2) 10Zabe: Stop writing to $wmfLocalServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774500 (https://phabricator.wikimedia.org/T45956) [20:26:24] (03PS1) 10Andrew Bogott: nova api: Use a diff rather than a replacement file for regex hack [puppet] - 10https://gerrit.wikimedia.org/r/775953 (https://phabricator.wikimedia.org/T207538) [20:26:52] (03CR) 10jerkins-bot: [V: 04-1] nova api: Use a diff rather than a replacement file for regex hack [puppet] - 10https://gerrit.wikimedia.org/r/775953 (https://phabricator.wikimedia.org/T207538) (owner: 10Andrew Bogott) [20:26:59] (03CR) 10jerkins-bot: [V: 04-1] Newcomer tasks: always align button and text to the right [extensions/GrowthExperiments] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/775371 (https://phabricator.wikimedia.org/T301825) (owner: 10Sergio Gimeno) [20:27:17] (03CR) 10Thcipriani: Newcomer tasks: always align button and text to the right [extensions/GrowthExperiments] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/775371 (https://phabricator.wikimedia.org/T301825) (owner: 10Sergio Gimeno) [20:27:21] (03CR) 10Thcipriani: [C: 03+2] Newcomer tasks: always align button and text to the right [extensions/GrowthExperiments] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/775371 (https://phabricator.wikimedia.org/T301825) (owner: 10Sergio Gimeno) [20:28:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:46] (03PS2) 10Andrew Bogott: nova api: Use a diff rather than a replacement file for regex hack [puppet] - 10https://gerrit.wikimedia.org/r/775953 (https://phabricator.wikimedia.org/T207538) [20:28:48] zabe: third one going now [20:29:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:29:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:42] !log thcipriani@deploy1002 Synchronized wmf-config: Config: [[gerrit:774499|Migrate $wmfLocalServices to $wmgLocalServices (T45956)]] (duration: 00m 51s) [20:29:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:45] (03PS1) 10Herron: admin: add tchin to groups platform-engineering and analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/775954 (https://phabricator.wikimedia.org/T305193) [20:29:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:50] T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956 [20:29:52] sergi0: I'm retesting yours, CI is reporting something that looks unrelated (AFAICT) [20:30:26] yes, just saw the failure. ty [20:30:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:55] (03CR) 10Thcipriani: [C: 03+2] Stop writing to $wmfLocalServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774500 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:30:58] brennen can I add some phpcs cleanup patches to the current window? [20:31:07] or thcipriani since it looks like you're deploying [20:31:22] which patches? :) [20:31:26] 10SRE-OnFire, 10Data-Persistence (Consultation), 10MediaWiki-extensions-CentralAuth, 10Platform Engineering, and 5 others: Slow query bringing down s7 - https://phabricator.wikimedia.org/T305119 (10RLazarus) Draft incident report: https://wikitech.wikimedia.org/wiki/Incident_documentation/2022-03-31_api_er... [20:31:28] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/775005/3 and the 2 follow ups [20:31:50] no production impact - phpcs.xml updates, rename test files, and add some tiny documentation [20:31:50] (03PS3) 10Andrew Bogott: nova api: Use a diff rather than a replacement file for regex hack [puppet] - 10https://gerrit.wikimedia.org/r/775953 (https://phabricator.wikimedia.org/T207538) [20:32:01] DannyS712: sure! [20:32:30] (03Merged) 10jenkins-bot: Stop writing to $wmfLocalServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774500 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:32:36] thanks - anything needed from me for testing? I assume not, probably doesn't even need to be staged on a debug host first [20:33:02] DannyS712: if you can add them to the Deployments page, that'd be helpful :) [20:33:07] doing [20:33:11] <3 [20:33:49] 10SRE-Access-Requests, 10Patch-For-Review: Access for new Data Platform Dev: Thomas Chin - https://phabricator.wikimedia.org/T305193 (10herron) Looping in @Ottomata and @odimitrijevic for analytics groupadd approval [20:34:08] (03PS1) 10Volans: sre.cdn.roll-restart-varnish: fix query condition [cookbooks] - 10https://gerrit.wikimedia.org/r/775955 [20:34:23] done [20:34:32] thanks [20:35:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:15] zabe: last one, going now! [20:36:17] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudvirt105[123].eqiad.wmnet - https://phabricator.wikimedia.org/T305194 (10RobH) [20:36:23] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 66 probes of 674 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:36:34] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudvirt105[123].eqiad.wmnet - https://phabricator.wikimedia.org/T305194 (10RobH) [20:36:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:36:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:36:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:54] (03PS4) 10Thcipriani: phpcs: enable rules that are already passing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775005 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [20:36:58] !log thcipriani@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:774500|Stop writing to $wmfLocalServices (T45956)]] (duration: 00m 50s) [20:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:04] T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956 [20:37:25] thcipriani, thx :) [20:37:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:38] zabe: should be live now, thanks for the patches [20:38:00] (03CR) 10Thcipriani: [C: 03+2] phpcs: enable rules that are already passing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775005 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [20:38:27] (03CR) 10Volans: [C: 03+2] "self-merging to fix the broken condition" [cookbooks] - 10https://gerrit.wikimedia.org/r/775955 (owner: 10Volans) [20:38:52] (03Merged) 10jenkins-bot: phpcs: enable rules that are already passing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775005 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [20:39:10] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudvirt105[123].eqiad.wmnet - https://phabricator.wikimedia.org/T305194 (10RobH) a:03Jclark-ctr [20:39:20] (03PS4) 10Thcipriani: phpcs: rename test files to match class names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775426 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [20:40:19] (03CR) 10Thcipriani: [C: 03+2] phpcs: rename test files to match class names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775426 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [20:40:52] !log reserving port 4017 for new k8s service request 'image-suggestions' T304891 [20:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:58] T304891: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 [20:41:08] (03Merged) 10jenkins-bot: phpcs: rename test files to match class names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775426 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [20:41:42] (03Merged) 10jenkins-bot: sre.cdn.roll-restart-varnish: fix query condition [cookbooks] - 10https://gerrit.wikimedia.org/r/775955 (owner: 10Volans) [20:41:49] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:42:03] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 64 probes of 674 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:42:32] (03PS4) 10Thcipriani: phpcs: enable and fix PropertyDocumentation.MissingVar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775427 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [20:42:36] (03CR) 10Thcipriani: [C: 03+2] phpcs: enable and fix PropertyDocumentation.MissingVar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775427 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [20:42:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:27] (03Merged) 10jenkins-bot: phpcs: enable and fix PropertyDocumentation.MissingVar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775427 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [20:43:30] 10SRE, 10Generated Data Platform, 10serviceops, 10Service-deployment-requests: Blubber setup for Image Suggestions Service - https://phabricator.wikimedia.org/T305155 (10Dzahn) port reserved: 4017 https://wikitech.wikimedia.org/wiki/Kubernetes/Service_ports [20:43:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:43:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:43:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:46] (03CR) 10Andrew Bogott: [C: 03+2] nova api: Use a diff rather than a replacement file for regex hack [puppet] - 10https://gerrit.wikimedia.org/r/775953 (https://phabricator.wikimedia.org/T207538) (owner: 10Andrew Bogott) [20:45:45] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:46:15] !log thcipriani@deploy1002 Synchronized phpcs.xml: Config (noop): [[gerrit:775427|phpcs: enable and fix PropertyDocumentation.MissingVar (T171115)]] [[gerrit:775426|phpcs: rename test files to match class names (T171115)]] [[gerrit:775005|phpcs: enable rules that are already passing (T171115)]] (duration: 00m 49s) [20:46:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:20] T171115: Remove phpcs exceptions and severity 0 from mediawiki-config - https://phabricator.wikimedia.org/T171115 [20:47:16] (03PS1) 10Andrew Bogott: nova-api/wallaby: remove redundant package requirement [puppet] - 10https://gerrit.wikimedia.org/r/775956 [20:47:41] !log thcipriani@deploy1002 Synchronized src/StaticSiteConfiguration.php: Config (noop -- comment change): [[gerrit:775427|phpcs: enable and fix PropertyDocumentation.MissingVar (T171115)]] (duration: 00m 50s) [20:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:57] (03CR) 10Andrew Bogott: [C: 03+2] nova-api/wallaby: remove redundant package requirement [puppet] - 10https://gerrit.wikimedia.org/r/775956 (owner: 10Andrew Bogott) [20:49:04] !log thcipriani@deploy1002 Synchronized tests: Config (noop -- tests) (duration: 00m 50s) [20:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:49:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:50:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:50:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:09] (03Merged) 10jenkins-bot: Newcomer tasks: always align button and text to the right [extensions/GrowthExperiments] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/775371 (https://phabricator.wikimedia.org/T301825) (owner: 10Sergio Gimeno) [20:51:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:26] 10SRE, 10Generated Data Platform, 10serviceops, 10Service-deployment-requests: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 (10Dzahn) found out the "add dummy tokens to labs/private" step is not needed anymore [20:52:29] sergi0: I think we merged! [20:53:07] thcipriani: yes, hurray! [20:53:39] sergi0: your change is live on mwdebug1002, can you check there to see if everything looks ok? [20:53:46] (03PS1) 10Andrew Bogott: openstack::nova::api::service::wallaby: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/775958 [20:53:56] yes, let me take a look [20:53:57] (03Abandoned) 10Andrew Bogott: OpenStack nova: Fix the regex hack that validates new VM names [puppet] - 10https://gerrit.wikimedia.org/r/775949 (https://phabricator.wikimedia.org/T304694) (owner: 10Andrew Bogott) [20:54:27] we're looking good :) [20:54:48] cool, thanks for checking syncing now! [20:55:09] (03CR) 10Andrew Bogott: [C: 03+2] openstack::nova::api::service::wallaby: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/775958 (owner: 10Andrew Bogott) [20:55:53] thcipriani: ty, I'll look again when it's on the cluster [20:56:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:56:39] !log thcipriani@deploy1002 Synchronized php-1.39.0-wmf.5/extensions/GrowthExperiments/modules/ext.growthExperiments.Homepage.SuggestedEdits/MatchModeSelectWidget.less: Backport: [[gerrit:775371|Newcomer tasks: always align button and text to the right (T301825)]] (duration: 00m 50s) [20:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:42] sergi0: and it's live now ^ \o/ [20:56:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:50] !log bking@cumin1001 START - Cookbook sre.wdqs.reboot [20:56:50] T301825: Account creation: add toggle to enable AND selection of interest topics - https://phabricator.wikimedia.org/T301825 [20:56:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:02] thcipriani thanks so much [20:57:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:57:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:57:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:25] thcipriani: all looking good! [20:58:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:27] for context the reason we backport this cosmetic issue is we conduct an experiment in an offline event tomorrow at 20:00 UTC in Mexico as part of a campaign. Thank you very much thcipriani! [20:59:50] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0) [20:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:45] 10SRE, 10ops-eqsin, 10Traffic: SMART error (CurrentPendingSector) detected on host: cp5004 - https://phabricator.wikimedia.org/T303043 (10BBlack) 05Open→03Resolved This seems to have resolved itself. There's no current SMART error, and all disks seem present and working at a glance. We can revisit if i... [21:01:04] (03CR) 10Nskaggs: [C: 03+1] "This triggered again just now. Thanks for cleaning up!" [puppet] - 10https://gerrit.wikimedia.org/r/774854 (https://phabricator.wikimedia.org/T304916) (owner: 10David Caro) [21:03:49] !log bking@cumin1001 START - Cookbook sre.wdqs.reboot [21:03:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:25] sergi0: yw! good luck with your experiment :) [21:06:34] !log utc late backport complete [21:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:34] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0) [21:07:35] PROBLEM - Host wdqs2002 is DOWN: PING CRITICAL - Packet loss = 100% [21:07:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:43] RECOVERY - Host wdqs2002 is UP: PING OK - Packet loss = 0%, RTA = 33.18 ms [21:07:54] !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=cp5012.eqsin.wmnet [21:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:03] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Access for new Data Platform Dev: Thomas Chin - https://phabricator.wikimedia.org/T305193 (10Ottomata) Approved! [21:08:24] (03PS1) 10Nskaggs: wmcs.backy2: fix typo in link to runbook for backup_vms [puppet] - 10https://gerrit.wikimedia.org/r/775961 (https://phabricator.wikimedia.org/T304408) [21:09:20] 10SRE, 10Infrastructure-Foundations: puppetmaster1001 disk warning on / - https://phabricator.wikimedia.org/T304898 (10Dzahn) down from 85% used to 81% used after a `apt-get clean` on puppetmaster1001 [21:10:00] (03PS2) 10Catrope: Remove unused Flow config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775876 (owner: 10Esanders) [21:10:19] (03CR) 10Catrope: [C: 03+2] Remove unused Flow config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775876 (owner: 10Esanders) [21:11:10] (03Merged) 10jenkins-bot: Remove unused Flow config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775876 (owner: 10Esanders) [21:13:27] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:13:43] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:13:47] !log catrope@deploy1002 Synchronized wmf-config/CommonSettings.php: [[gerrit:775876|Remove unused Flow config]] (duration: 00m 49s) [21:13:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:49] thcipriani: thanks! [21:17:21] !log bblack@cumin1001 conftool action : select; selector: name="^(cp1075|cp1079|cp2035|cp3050|cp3051|cp3052|cp3054|cp4022|cp5013|cp5014|cp5015).*" [21:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:30] !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=^(cp1075|cp1079|cp2035|cp3050|cp3051|cp3052|cp3054|cp4022|cp5013|cp5014|cp5015).* [21:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:35] there we go [21:19:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:19:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:53] (03PS1) 10Dzahn: add a namespace for new service image-suggestions [deployment-charts] - 10https://gerrit.wikimedia.org/r/775964 (https://phabricator.wikimedia.org/T304891) [21:20:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:35] (03PS1) 10Ebernhardson: cirrus: Migrate popularity_score configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775965 [21:23:01] (03CR) 10jerkins-bot: [V: 04-1] cirrus: Migrate popularity_score configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775965 (owner: 10Ebernhardson) [21:28:39] (03PS1) 10Volans: sre.cdn.roll-restart-varnish: fix prometheus query [cookbooks] - 10https://gerrit.wikimedia.org/r/775968 [21:29:29] (03CR) 10BBlack: [C: 03+1] "Seems sensible!" [cookbooks] - 10https://gerrit.wikimedia.org/r/775968 (owner: 10Volans) [21:31:49] (Device rebooted) firing: Alert for device dellasw1.mgmt.codfw.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [21:32:58] (03CR) 10Volans: [C: 03+2] sre.cdn.roll-restart-varnish: fix prometheus query [cookbooks] - 10https://gerrit.wikimedia.org/r/775968 (owner: 10Volans) [21:36:49] (Device rebooted) resolved: Device dellasw1.mgmt.codfw.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [21:37:03] (03PS1) 10Ryan Kemper: wdqs: relax depool sleep by 1min [cookbooks] - 10https://gerrit.wikimedia.org/r/775969 [21:37:05] (03Merged) 10jenkins-bot: sre.cdn.roll-restart-varnish: fix prometheus query [cookbooks] - 10https://gerrit.wikimedia.org/r/775968 (owner: 10Volans) [21:37:48] (03CR) 10Bking: [C: 03+1] wdqs: relax depool sleep by 1min [cookbooks] - 10https://gerrit.wikimedia.org/r/775969 (owner: 10Ryan Kemper) [21:37:53] (03PS2) 10Ryan Kemper: wdqs: relax depool sleep by 1min [cookbooks] - 10https://gerrit.wikimedia.org/r/775969 [21:40:17] !log bking@cumin1001 START - Cookbook sre.wdqs.reboot [21:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:41] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] wdqs: relax depool sleep by 1min [cookbooks] - 10https://gerrit.wikimedia.org/r/775969 (owner: 10Ryan Kemper) [21:42:51] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:43:40] (03PS1) 10Volans: cookbooks.sre: SREBatchRunnerBase log query [cookbooks] - 10https://gerrit.wikimedia.org/r/775970 [21:43:42] (03PS1) 10Volans: sre.cdn.roll-restart-varnish: optimize query [cookbooks] - 10https://gerrit.wikimedia.org/r/775971 [21:44:10] (03Merged) 10jenkins-bot: wdqs: relax depool sleep by 1min [cookbooks] - 10https://gerrit.wikimedia.org/r/775969 (owner: 10Ryan Kemper) [21:44:49] (03PS2) 10Volans: cookbooks.sre: SREBatchRunnerBase log query [cookbooks] - 10https://gerrit.wikimedia.org/r/775970 [21:45:30] (03CR) 10Volans: [C: 03+2] "trivial logging improvement, self-merging" [cookbooks] - 10https://gerrit.wikimedia.org/r/775970 (owner: 10Volans) [21:47:01] (03PS4) 10BBlack: geodns: remove geo-maps-esams-offline hack [dns] - 10https://gerrit.wikimedia.org/r/771631 (https://phabricator.wikimedia.org/T304089) [21:47:03] (03PS5) 10BBlack: geodns: add drmrs fallback for esams to whole map [dns] - 10https://gerrit.wikimedia.org/r/771632 (https://phabricator.wikimedia.org/T304089) [21:48:04] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0) [21:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:31] (03PS1) 10Muehlenhoff: Switch idp-test to idp-test1001 [dns] - 10https://gerrit.wikimedia.org/r/775976 [21:50:11] (03Merged) 10jenkins-bot: cookbooks.sre: SREBatchRunnerBase log query [cookbooks] - 10https://gerrit.wikimedia.org/r/775970 (owner: 10Volans) [21:50:42] (03PS2) 10Volans: sre.cdn.roll-restart-varnish: optimize query [cookbooks] - 10https://gerrit.wikimedia.org/r/775971 [21:51:11] !log bking@cumin1001 START - Cookbook sre.wdqs.reboot [21:51:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:25] (03CR) 10Muehlenhoff: [C: 03+2] Switch idp-test to idp-test1001 [dns] - 10https://gerrit.wikimedia.org/r/775976 (owner: 10Muehlenhoff) [21:51:57] (03CR) 10Volans: [C: 03+2] "trivial, self-merging" [cookbooks] - 10https://gerrit.wikimedia.org/r/775971 (owner: 10Volans) [21:55:05] (03Merged) 10jenkins-bot: sre.cdn.roll-restart-varnish: optimize query [cookbooks] - 10https://gerrit.wikimedia.org/r/775971 (owner: 10Volans) [22:02:48] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0) [22:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:37] !log bking@cumin1001 START - Cookbook sre.wdqs.reboot [22:03:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:10:45] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:15:12] (03CR) 10DannyS712: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775939 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [22:15:18] (03PS1) 10Muehlenhoff: Failover IDP to idp2001 [dns] - 10https://gerrit.wikimedia.org/r/775990 [22:16:51] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0) [22:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:27] (03CR) 10Muehlenhoff: [C: 03+2] Failover IDP to idp2001 [dns] - 10https://gerrit.wikimedia.org/r/775990 (owner: 10Muehlenhoff) [22:18:34] (03CR) 10Dzahn: "Let's add a second, dedicated disk for the backups. This can also be merged to prevent disk running full soon.. but we should revert it af" [puppet] - 10https://gerrit.wikimedia.org/r/775265 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [22:21:33] (03PS1) 10Andrew Bogott: wmcs-cinder-volume-backup: use created_at rather than modified_at for purging [puppet] - 10https://gerrit.wikimedia.org/r/775997 [22:24:23] !log ganeti - creating new 100G virtual disk on gitlab2001 T274463 [22:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:29] T274463: Backups for GitLab - https://phabricator.wikimedia.org/T274463 [22:27:15] (03CR) 10Andrew Bogott: "dcaro: fyi since if this goes horribly wrong you'll most likely catch the alert" [puppet] - 10https://gerrit.wikimedia.org/r/775997 (owner: 10Andrew Bogott) [22:28:49] !log ganeti - creating new 100G virtual disk on gitlab1001 T274463 [22:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:01] (03CR) 10BBlack: [C: 03+2] geodns: remove geo-maps-esams-offline hack [dns] - 10https://gerrit.wikimedia.org/r/771631 (https://phabricator.wikimedia.org/T304089) (owner: 10BBlack) [22:29:22] (03PS5) 10BBlack: geodns: remove geo-maps-esams-offline hack [dns] - 10https://gerrit.wikimedia.org/r/771631 (https://phabricator.wikimedia.org/T304089) [22:30:31] (03PS6) 10BBlack: geodns: add drmrs fallback for esams to whole map [dns] - 10https://gerrit.wikimedia.org/r/771632 (https://phabricator.wikimedia.org/T304089) [22:31:06] (03CR) 10BBlack: [C: 03+2] geodns: add drmrs fallback for esams to whole map [dns] - 10https://gerrit.wikimedia.org/r/771632 (https://phabricator.wikimedia.org/T304089) (owner: 10BBlack) [22:34:29] !log updated CAS to 6.4.6.2 [22:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:07] (03PS1) 10BBlack: Remove one last esams-offline note [dns] - 10https://gerrit.wikimedia.org/r/776003 (https://phabricator.wikimedia.org/T304089) [22:37:09] (03PS1) 10BBlack: Depool esams to test drmrs at full EMEA load [dns] - 10https://gerrit.wikimedia.org/r/776004 (https://phabricator.wikimedia.org/T304089) [22:37:57] (03CR) 10BBlack: [C: 03+2] Remove one last esams-offline note [dns] - 10https://gerrit.wikimedia.org/r/776003 (https://phabricator.wikimedia.org/T304089) (owner: 10BBlack) [22:40:53] (03PS1) 10Zabe: Remove esams-offline note [dns] - 10https://gerrit.wikimedia.org/r/776005 [22:42:20] (03CR) 10BBlack: "Thanks!" [dns] - 10https://gerrit.wikimedia.org/r/776005 (owner: 10Zabe) [22:42:31] (03CR) 10BBlack: [C: 03+2] Remove esams-offline note [dns] - 10https://gerrit.wikimedia.org/r/776005 (owner: 10Zabe) [22:43:29] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:44:32] (03PS2) 10BBlack: Depool esams to test drmrs at full EMEA load [dns] - 10https://gerrit.wikimedia.org/r/776004 (https://phabricator.wikimedia.org/T304089) [22:46:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:00:15] PROBLEM - SSH on aqs1007.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:00:28] (03CR) 10BBlack: [C: 03+2] Depool esams to test drmrs at full EMEA load [dns] - 10https://gerrit.wikimedia.org/r/776004 (https://phabricator.wikimedia.org/T304089) (owner: 10BBlack) [23:01:11] !log esams->drmrs failover test begins - T304089 [23:01:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:17] T304089: drmrs: initial geodns configuration - https://phabricator.wikimedia.org/T304089 [23:03:03] (03PS1) 10Zabe: Remove esams-offline note from README [dns] - 10https://gerrit.wikimedia.org/r/776009 (https://phabricator.wikimedia.org/T304089) [23:06:23] (03CR) 10BBlack: [C: 03+2] "Thanks again!" [dns] - 10https://gerrit.wikimedia.org/r/776009 (https://phabricator.wikimedia.org/T304089) (owner: 10Zabe) [23:07:41] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 54.1 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [23:08:25] RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:15:28] (03PS1) 10Muehlenhoff: Update to 6.4.6.2 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/776010 [23:21:48] !log gitlab2001 (gitlab-replica.wikimedia.org) - rebooting to add new virtual disk T274463 [23:21:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:56] T274463: Backups for GitLab - https://phabricator.wikimedia.org/T274463 [23:25:53] PROBLEM - Host gitlab2001 is DOWN: PING CRITICAL - Packet loss = 100% [23:26:53] RECOVERY - Host gitlab2001 is UP: PING OK - Packet loss = 0%, RTA = 30.36 ms [23:27:31] !log gitlab2001 - rebooted on ganeti level (needed when adding new virtual hardware), then ran into the usual bug T272555 where you have to manually fix the interface in /etc/network/interfaces T274463 [23:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:37] T272555: releases2002 ganeti VM not getting IP after reboot - https://phabricator.wikimedia.org/T272555 [23:27:38] T274463: Backups for GitLab - https://phabricator.wikimedia.org/T274463 [23:29:31] PROBLEM - Check systemd state on gitlab2001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens14.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:30:22] (JobUnavailable) firing: (5) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:31:45] RECOVERY - Check systemd state on gitlab2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:36:51] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [23:45:02] !log gitlab2001 - fdisk /dev/vdb (g, w) (create partition table), (n, w) (create partition) ; mkfs.ext4 /dev/vdb1 (create filesystem); systemctl reset-failed (fix Icinga alert); mkdir /mnt/gitlab-backup; mount /dev/vdb1 /mnt/gitlab-backup ; blkid (get UUID); edit /etc/fstab and insert "UUID=c5235682-ac21-46a9-85ee-9603f694a6a4 /mnt/gitlab-backup ext4 errors=remount-ro 0 2" T274463 [23:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:45:09] T274463: Backups for GitLab - https://phabricator.wikimedia.org/T274463 [23:52:01] PROBLEM - Check systemd state on gitlab2001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens14.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:53:10] (03CR) 10Dzahn: "gitlab2001 now has a new disk and it's mounted on /mnt/gitlab-backups (as opposed to /srv/gitlab-backups)." [puppet] - 10https://gerrit.wikimedia.org/r/775265 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)