[00:01:15] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [00:01:47] ^ 👀 [00:02:43] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:03:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P24073 and previous config saved to /var/cache/conftool/dbconfig/20220405-000355-ladsgroup.json [00:03:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:59] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp1083.eqiad.wmnet [00:07:43] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp3062.esams.wmnet [00:08:14] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp5012.eqsin.wmnet [00:10:07] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [00:11:17] ^ I wasn't able to find out what that was, but it seems over nwo [00:11:19] *now [00:11:53] POST latency only, and appservers only (not API) without a smoking gun from any particular backend! very weird [00:12:05] I'll poke around a little more but then leave it alone, unless it recurs [00:16:16] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1083.eqiad.wmnet [00:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:37] PROBLEM - Check systemd state on gitlab1001 is CRITICAL: CRITICAL - degraded: The following units failed: full-backup.service,prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:17:11] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3062.esams.wmnet [00:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [00:18:00] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5012.eqsin.wmnet [00:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P24074 and previous config saved to /var/cache/conftool/dbconfig/20220405-001900-ladsgroup.json [00:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:02] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp104[6-8].eqiad.wmnet [00:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:25] PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 3210 bytes in 0.206 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [00:22:55] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [00:23:28] !log wtp1046, wtp1047, wtp1048 - rebooting, one at a time [00:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:10] 10SRE, 10Generated Data Platform, 10Image-Suggestions, 10serviceops, and 2 others: Blubber setup for Image Suggestions Service - https://phabricator.wikimedia.org/T305155 (10Dzahn) >>! In T305155#7823133, @Dzahn wrote: > port reserved: 4017 > > https://wikitech.wikimedia.org/wiki/Kubernetes/Service_ports... [00:27:18] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1048.eqiad.wmnet [00:27:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:28:21] ACKNOWLEDGEMENT - Host wtp1047 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn reboot [00:30:13] RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 121816 bytes in 0.197 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [00:30:41] !log gitlab.wikimedia.org was down because gitlab1001 ran out of disk space. ran 'apt-get clean' to free 13G which made it recover... [00:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:44] hrmmm [00:31:31] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:32:47] !log gitlab.wikimedia.org was down because gitlab1001 ran out of disk space. ran 'apt-get clean' to free 13G which made it recover... T274463 - <+icinga-wm> RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK [00:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:50] T274463: Backups for GitLab - https://phabricator.wikimedia.org/T274463 [00:33:14] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1047.eqiad.wmnet [00:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:20] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1046.eqiad.wmnet [00:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:31] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4032.ulsfo.wmnet [00:33:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T298565)', diff saved to https://phabricator.wikimedia.org/P24075 and previous config saved to /var/cache/conftool/dbconfig/20220405-003405-ladsgroup.json [00:34:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:09] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [00:34:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [00:34:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [00:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24076 and previous config saved to /var/cache/conftool/dbconfig/20220405-003419-ladsgroup.json [00:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:01] !log gitlab2001 - apt-get clean to prevent disk space issues [00:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:44] !log gitlab1001 - mv 1648814678_2022_04_01_14.9.1_gitlab_backup.tar and other files from April 2nd/April 3rd over from /srv/gitlab-backup to /mnt/gitlab-backup to prevent another outage due to disk space T274463 [00:39:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:48] T274463: Backups for GitLab - https://phabricator.wikimedia.org/T274463 [00:40:58] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4032.ulsfo.wmnet [00:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:42] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Wikimedia-Mailing-lists: mass Yahoo / AOL bounces mailman - https://phabricator.wikimedia.org/T232417 (10Effeietsanders) I suspect this ticket can now be resolved. Haven't seen recent activity. [00:42:07] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp2042.codfw.wmnet [00:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:42:38] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp1084.eqiad.wmnet [00:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:55] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp5016.eqsin.wmnet [00:43:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:50:26] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2042.codfw.wmnet [00:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:03] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4034.ulsfo.wmnet [00:51:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:49] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1084.eqiad.wmnet [00:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:30] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp3063.esams.wmnet [00:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:42] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5016.eqsin.wmnet [00:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:58:43] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4034.ulsfo.wmnet [00:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220405T0100) [01:02:59] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3063.esams.wmnet [01:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:58] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp5002.eqsin.wmnet [01:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:07:27] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp3053.esams.wmnet [01:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:15:48] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3053.esams.wmnet [01:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:21:41] (03PS1) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [01:23:15] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [01:27:01] PROBLEM - Host cp5002 is DOWN: PING CRITICAL - Packet loss = 100% [01:32:37] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:33:21] yeah, the cookbook for cp5002 is stuck at [77/120, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Unable to get uptime for cp5002.eqsin.wmnet [01:33:43] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:36:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24077 and previous config saved to /var/cache/conftool/dbconfig/20220405-013609-ladsgroup.json [01:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:36:13] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [01:41:15] (JobUnavailable) firing: (5) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:30] (JobUnavailable) firing: (5) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:56] 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) [01:47:25] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host cp5002.eqsin.wmnet [01:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:51:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P24078 and previous config saved to /var/cache/conftool/dbconfig/20220405-015114-ladsgroup.json [01:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:52:30] 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) @ayounsi I update the table with transit/transport links. Please double check. For cr1 to cr2 I have a total of 3 links 2 on FPC3 and 1 on FPC4. My guess is the link on FPC4 is there in case FPC3... [01:59:26] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on cp5002.eqsin.wmnet with reason: downtimed because of hardware failure: T305423 [01:59:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:59:28] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on cp5002.eqsin.wmnet with reason: downtimed because of hardware failure: T305423 [01:59:29] T305423: cp5002 memory errors on DIMM A4 - https://phabricator.wikimedia.org/T305423 [01:59:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P24079 and previous config saved to /var/cache/conftool/dbconfig/20220405-020619-ladsgroup.json [02:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:07:33] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.39.0-wmf.6 [core] (wmf/1.39.0-wmf.6) - 10https://gerrit.wikimedia.org/r/777041 [02:07:39] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.39.0-wmf.6 [core] (wmf/1.39.0-wmf.6) - 10https://gerrit.wikimedia.org/r/777041 (owner: 10TrainBranchBot) [02:07:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:07:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:08:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:08:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:21:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24080 and previous config saved to /var/cache/conftool/dbconfig/20220405-022124-ladsgroup.json [02:21:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [02:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:21:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [02:21:28] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [02:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:21:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T298565)', diff saved to https://phabricator.wikimedia.org/P24081 and previous config saved to /var/cache/conftool/dbconfig/20220405-022132-ladsgroup.json [02:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:24:34] (03Merged) 10jenkins-bot: Branch commit for wmf/1.39.0-wmf.6 [core] (wmf/1.39.0-wmf.6) - 10https://gerrit.wikimedia.org/r/777041 (owner: 10TrainBranchBot) [02:29:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:29:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:29:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:29:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:37:46] (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs2003:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [03:17:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T298565)', diff saved to https://phabricator.wikimedia.org/P24082 and previous config saved to /var/cache/conftool/dbconfig/20220405-031745-ladsgroup.json [03:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:17:50] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [03:32:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P24083 and previous config saved to /var/cache/conftool/dbconfig/20220405-033251-ladsgroup.json [03:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:47:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P24084 and previous config saved to /var/cache/conftool/dbconfig/20220405-034756-ladsgroup.json [03:47:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:01:57] PROBLEM - Check unit status of netbox_ganeti_esams_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_esams_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:02:43] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:03:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T298565)', diff saved to https://phabricator.wikimedia.org/P24085 and previous config saved to /var/cache/conftool/dbconfig/20220405-040301-ladsgroup.json [04:03:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [04:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:03:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [04:03:05] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [04:03:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:03:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:03:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1164 (T298565)', diff saved to https://phabricator.wikimedia.org/P24086 and previous config saved to /var/cache/conftool/dbconfig/20220405-040309-ladsgroup.json [04:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:12:41] RECOVERY - Check unit status of netbox_ganeti_esams_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_esams_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:34:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1132 for testing T301879', diff saved to https://phabricator.wikimedia.org/P24087 and previous config saved to /var/cache/conftool/dbconfig/20220405-043426-marostegui.json [04:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:34:31] T301879: Test MariaDB 10.6 on Bullseye - https://phabricator.wikimedia.org/T301879 [05:00:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T298565)', diff saved to https://phabricator.wikimedia.org/P24088 and previous config saved to /var/cache/conftool/dbconfig/20220405-050047-ladsgroup.json [05:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:51] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [05:12:07] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 107 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:15:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P24089 and previous config saved to /var/cache/conftool/dbconfig/20220405-051552-ladsgroup.json [05:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:32] <_joe_> !log uploading new minor version of conftool to apt for buster/bullseye (requestctl new feature) [05:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P24090 and previous config saved to /var/cache/conftool/dbconfig/20220405-053057-ladsgroup.json [05:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:03] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:42:30] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:46:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T298565)', diff saved to https://phabricator.wikimedia.org/P24091 and previous config saved to /var/cache/conftool/dbconfig/20220405-054602-ladsgroup.json [05:46:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [05:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [05:46:07] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [05:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T298565)', diff saved to https://phabricator.wikimedia.org/P24092 and previous config saved to /var/cache/conftool/dbconfig/20220405-054610-ladsgroup.json [05:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1132 for testing T301879', diff saved to https://phabricator.wikimedia.org/P24093 and previous config saved to /var/cache/conftool/dbconfig/20220405-055256-marostegui.json [05:52:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:00] T301879: Test MariaDB 10.6 on Bullseye - https://phabricator.wikimedia.org/T301879 [06:00:05] kormat, marostegui, and Amir1: My dear minions, it's time we take the moon! Just kidding. Time for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220405T0600). [06:01:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1132 into API for testing T301879', diff saved to https://phabricator.wikimedia.org/P24094 and previous config saved to /var/cache/conftool/dbconfig/20220405-060124-marostegui.json [06:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:30] T301879: Test MariaDB 10.6 on Bullseye - https://phabricator.wikimedia.org/T301879 [06:06:01] PROBLEM - LVS zotero codfw port 4969/tcp - Zotero- zotero.svc.codfw.wmnet IPv4 on zotero.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [06:08:07] RECOVERY - LVS zotero codfw port 4969/tcp - Zotero- zotero.svc.codfw.wmnet IPv4 on zotero.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 196 bytes in 1.140 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [06:12:25] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:21:05] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/776982 (https://phabricator.wikimedia.org/T305403) (owner: 10Herron) [06:36:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'More weight to db1132 T301879', diff saved to https://phabricator.wikimedia.org/P24095 and previous config saved to /var/cache/conftool/dbconfig/20220405-063648-marostegui.json [06:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:52] T301879: Test MariaDB 10.6 on Bullseye - https://phabricator.wikimedia.org/T301879 [06:37:13] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:37:46] (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs2003:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [06:50:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T298565)', diff saved to https://phabricator.wikimedia.org/P24096 and previous config saved to /var/cache/conftool/dbconfig/20220405-065053-ladsgroup.json [06:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:58] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [06:58:20] 10SRE, 10Wikimedia-SVG-rendering: Adding new font for CJK media display - https://phabricator.wikimedia.org/T280432 (10NFSL2001) @MoritzMuehlenhoff Any updates for this? The preview character is still missisng and problamatic to Chinese users. Here is a screenshot of how it should look for last 2 elements: {F... [07:00:05] Amir1, awight, Urbanecm, and taavi: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220405T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:22] o/ looks like nothing to do [07:04:46] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:05:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P24097 and previous config saved to /var/cache/conftool/dbconfig/20220405-070558-ladsgroup.json [07:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:20] (03PS1) 10Ayounsi: sflow: fix pre_tag2_filter [puppet] - 10https://gerrit.wikimedia.org/r/777292 (https://phabricator.wikimedia.org/T263277) [07:21:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P24098 and previous config saved to /var/cache/conftool/dbconfig/20220405-072103-ladsgroup.json [07:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:52] (03CR) 10Ayounsi: [C: 03+1] sre.SREBatchBase: additional customizations [cookbooks] - 10https://gerrit.wikimedia.org/r/776965 (owner: 10Volans) [07:23:27] (03CR) 10Ayounsi: [C: 03+1] sre.cdn.roll-restart-varnish: improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/776966 (owner: 10Volans) [07:28:00] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (Kanban): PXE boot failures on cloudvirt-wdqs100[1-3] - https://phabricator.wikimedia.org/T305368 (10ayounsi) 05Open→03Resolved a:05Andrew→03ayounsi Fix merged. [07:36:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T298565)', diff saved to https://phabricator.wikimedia.org/P24099 and previous config saved to /var/cache/conftool/dbconfig/20220405-073608-ladsgroup.json [07:36:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [07:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [07:36:12] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [07:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T298565)', diff saved to https://phabricator.wikimedia.org/P24100 and previous config saved to /var/cache/conftool/dbconfig/20220405-073617-ladsgroup.json [07:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:44] (03PS1) 10Jaime Nuche: testwikis wikis to 1.39.0-wmf.6 refs T305212 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777294 [07:37:46] (03CR) 10Jaime Nuche: [C: 03+2] testwikis wikis to 1.39.0-wmf.6 refs T305212 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777294 (owner: 10Jaime Nuche) [07:38:25] (03Merged) 10jenkins-bot: testwikis wikis to 1.39.0-wmf.6 refs T305212 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777294 (owner: 10Jaime Nuche) [07:38:27] !log jnuche@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.6 refs T305212 [07:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:30] T305212: 1.39.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T305212 [07:39:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:39:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:46:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:46:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:47:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10ayounsi) See comment in T303424#7830897, they *might* be able to go in any 10G rack, private vlan. Regardless, those are prod hosts (public/pr... [07:51:13] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:52:58] !log disable BGP to Tata in drmrs for circuit move - T298208 [07:52:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:19] (03CR) 10JMeybohm: [C: 03+2] Update cert-manager to 1.5.4-3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/776971 (https://phabricator.wikimedia.org/T304092) (owner: 10JMeybohm) [07:57:36] (03Merged) 10jenkins-bot: Update cert-manager to 1.5.4-3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/776971 (https://phabricator.wikimedia.org/T304092) (owner: 10JMeybohm) [08:00:04] jnuche and hashar: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220405T0800). [08:02:43] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:12:00] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [08:12:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:49] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [08:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:13] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [08:13:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:59] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [08:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:17] (03CR) 10JMeybohm: [C: 03+1] Change the Calico's pod IP subnet for ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/776877 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey) [08:14:25] (03CR) 10JMeybohm: [C: 03+1] Change the Calico's pod IP subnet for ml-serve-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/776876 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey) [08:14:34] (03PS1) 10Daniel Kinzler: Add ~daniel/.profile [puppet] - 10https://gerrit.wikimedia.org/r/777298 [08:14:53] (03CR) 10Jcrespo: [C: 03+1] "fileset change seems good" [puppet] - 10https://gerrit.wikimedia.org/r/776230 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [08:15:12] (03CR) 10JMeybohm: [C: 03+1] role::ml_k8s::master: change the svc eqiad IP subnet [puppet] - 10https://gerrit.wikimedia.org/r/776879 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey) [08:15:29] (03CR) 10JMeybohm: [C: 03+1] role::ml_k8s::master: change the codfw svc IP range [puppet] - 10https://gerrit.wikimedia.org/r/776880 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey) [08:19:21] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host dragonfly-supernode2001.codfw.wmnet [08:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:21] !log jnuche@deploy1002 Finished scap: testwikis wikis to 1.39.0-wmf.6 refs T305212 (duration: 42m 53s) [08:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:24] T305212: 1.39.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T305212 [08:23:10] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudgw1001.eqiad.wmnet with OS bullseye [08:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:13] (03PS1) 10Jaime Nuche: group0 wikis to 1.39.0-wmf.6 refs T305212 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777299 [08:26:16] (03CR) 10Jaime Nuche: [C: 03+2] group0 wikis to 1.39.0-wmf.6 refs T305212 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777299 (owner: 10Jaime Nuche) [08:26:50] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host dragonfly-supernode2001.codfw.wmnet [08:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:56] (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.6 refs T305212 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777299 (owner: 10Jaime Nuche) [08:27:03] PROBLEM - Check systemd state on dragonfly-supernode2001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:28:08] !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudgw1001.eqiad.wmnet with OS bullseye [08:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:50] (03CR) 10Btullis: [C: 03+2] Define the DATHUB_SECRET value [deployment-charts] - 10https://gerrit.wikimedia.org/r/776954 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [08:29:12] (03CR) 10Btullis: [C: 03+2] Remove test hosts from the JVM heap memory alerts [alerts] - 10https://gerrit.wikimedia.org/r/776919 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [08:29:19] RECOVERY - Check systemd state on dragonfly-supernode2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:29:32] (03CR) 10Btullis: [C: 03+2] Remove the statsv source from the VarnishkafkaNoMessages alert [alerts] - 10https://gerrit.wikimedia.org/r/776912 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [08:31:23] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.6 refs T305212 [08:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:26] T305212: 1.39.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T305212 [08:31:36] (03Merged) 10jenkins-bot: Remove test hosts from the JVM heap memory alerts [alerts] - 10https://gerrit.wikimedia.org/r/776919 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [08:33:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:14] (03PS1) 10Volans: homer: adjust the daily diff start time [puppet] - 10https://gerrit.wikimedia.org/r/777300 [08:34:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T298565)', diff saved to https://phabricator.wikimedia.org/P24101 and previous config saved to /var/cache/conftool/dbconfig/20220405-083423-ladsgroup.json [08:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:27] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [08:35:02] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudgw1001.eqiad.wmnet with OS bullseye [08:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:01] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [08:41:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:55] (03PS3) 10Jcrespo: dbbackups: Setup a valid_sections.txt config for db backup checks [puppet] - 10https://gerrit.wikimedia.org/r/776969 (https://phabricator.wikimedia.org/T301315) [08:45:50] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Setup a valid_sections.txt config for db backup checks [puppet] - 10https://gerrit.wikimedia.org/r/776969 (https://phabricator.wikimedia.org/T301315) (owner: 10Jcrespo) [08:46:30] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw1001.eqiad.wmnet with reason: host reimage [08:46:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:19] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw1001.eqiad.wmnet with reason: host reimage [08:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P24102 and previous config saved to /var/cache/conftool/dbconfig/20220405-084928-ladsgroup.json [08:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:16] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [08:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:21] (03PS1) 10JMeybohm: Move kubestagemaster* to bullseye and upstream docker [puppet] - 10https://gerrit.wikimedia.org/r/777309 (https://phabricator.wikimedia.org/T305435) [08:58:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10Majavah) Cloudwebs need access to various OS APIs. Most of them are hosted in the production realm and should be accessible from any production... [09:04:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P24103 and previous config saved to /var/cache/conftool/dbconfig/20220405-090434-ladsgroup.json [09:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:57] (03PS1) 10JMeybohm: Move kubemaster2002 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/777310 (https://phabricator.wikimedia.org/T305435) [09:05:59] (03PS1) 10JMeybohm: Move kubemaster2001 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/777311 (https://phabricator.wikimedia.org/T305435) [09:07:18] (03PS3) 10Jcrespo: dbbackups: Monitor db_inventory rather than zarcillo section [puppet] - 10https://gerrit.wikimedia.org/r/776170 (https://phabricator.wikimedia.org/T301315) [09:07:20] (03PS1) 10Jcrespo: dbbackups: Update valid_sections.txt permissions to be world-readable [puppet] - 10https://gerrit.wikimedia.org/r/777312 (https://phabricator.wikimedia.org/T301315) [09:07:34] (03PS2) 10Jcrespo: dbbackups: Update valid_sections.txt permissions to be world-readable [puppet] - 10https://gerrit.wikimedia.org/r/777312 (https://phabricator.wikimedia.org/T301315) [09:07:51] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34691/console" [puppet] - 10https://gerrit.wikimedia.org/r/777309 (https://phabricator.wikimedia.org/T305435) (owner: 10JMeybohm) [09:08:46] (03CR) 10Volans: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/777298 (owner: 10Daniel Kinzler) [09:08:52] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34692/console" [puppet] - 10https://gerrit.wikimedia.org/r/777310 (https://phabricator.wikimedia.org/T305435) (owner: 10JMeybohm) [09:08:54] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34693/console" [puppet] - 10https://gerrit.wikimedia.org/r/777311 (https://phabricator.wikimedia.org/T305435) (owner: 10JMeybohm) [09:09:33] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is CRITICAL: 18.33 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [09:10:02] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Update valid_sections.txt permissions to be world-readable [puppet] - 10https://gerrit.wikimedia.org/r/777312 (https://phabricator.wikimedia.org/T301315) (owner: 10Jcrespo) [09:11:10] (03PS1) 10Arturo Borrero Gonzalez: cloudgw1001: hieradata: refresh NIC names [puppet] - 10https://gerrit.wikimedia.org/r/777313 (https://phabricator.wikimedia.org/T304598) [09:11:15] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: Revert "group0 wikis to 1.39.0-wmf.6" [09:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:45] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [09:12:00] (03PS1) 10Btullis: Allow wikikube staging pod range to access kafka eqiad-test cluster [puppet] - 10https://gerrit.wikimedia.org/r/777314 (https://phabricator.wikimedia.org/T303049) [09:12:11] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw1001: hieradata: refresh NIC names [puppet] - 10https://gerrit.wikimedia.org/r/777313 (https://phabricator.wikimedia.org/T304598) (owner: 10Arturo Borrero Gonzalez) [09:12:50] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [09:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:58] (03PS1) 10Jaime Nuche: Revert "group0 wikis to 1.39.0-wmf.6 refs T305212" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777315 [09:13:00] (03CR) 10Jaime Nuche: [C: 03+2] Revert "group0 wikis to 1.39.0-wmf.6 refs T305212" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777315 (owner: 10Jaime Nuche) [09:13:15] (03PS1) 10Majavah: dynamicproxy: remove support for x-novaproxy-edit-dns [puppet] - 10https://gerrit.wikimedia.org/r/777316 (https://phabricator.wikimedia.org/T295246) [09:13:22] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34694/console" [puppet] - 10https://gerrit.wikimedia.org/r/777314 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis) [09:13:44] (03Merged) 10jenkins-bot: Revert "group0 wikis to 1.39.0-wmf.6 refs T305212" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777315 (owner: 10Jaime Nuche) [09:14:41] (03PS1) 10MMandere: site: Reimage cp5015 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777317 (https://phabricator.wikimedia.org/T290005) [09:14:43] (03PS1) 10MMandere: site: Reimage cp6007 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777318 (https://phabricator.wikimedia.org/T290005) [09:14:45] (03PS1) 10MMandere: site: Reimage cp5013 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777319 (https://phabricator.wikimedia.org/T290005) [09:14:47] (03PS1) 10MMandere: site: Reimage cp5007 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777320 (https://phabricator.wikimedia.org/T290005) [09:14:49] (03PS1) 10MMandere: site: Reimage cp5001 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777321 (https://phabricator.wikimedia.org/T290005) [09:14:51] (03PS1) 10MMandere: site: Reimage cp4035 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777322 (https://phabricator.wikimedia.org/T290005) [09:14:53] (03PS1) 10MMandere: site: Reimage cp3052 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777323 (https://phabricator.wikimedia.org/T290005) [09:14:55] (03PS1) 10MMandere: site: Reimage cp4027 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777324 (https://phabricator.wikimedia.org/T290005) [09:14:57] (03PS1) 10MMandere: site: Reimage cp4033 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777325 (https://phabricator.wikimedia.org/T290005) [09:14:59] (03PS1) 10MMandere: site: Reimage cp4021 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777326 (https://phabricator.wikimedia.org/T290005) [09:15:01] (03PS1) 10JMeybohm: Move kubemaster1002 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/777327 (https://phabricator.wikimedia.org/T305435) [09:15:03] (03PS1) 10JMeybohm: Move kubemaster1001 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/777328 (https://phabricator.wikimedia.org/T305435) [09:17:05] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34695/console" [puppet] - 10https://gerrit.wikimedia.org/r/777327 (https://phabricator.wikimedia.org/T305435) (owner: 10JMeybohm) [09:17:26] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34696/console" [puppet] - 10https://gerrit.wikimedia.org/r/777328 (https://phabricator.wikimedia.org/T305435) (owner: 10JMeybohm) [09:19:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T298565)', diff saved to https://phabricator.wikimedia.org/P24104 and previous config saved to /var/cache/conftool/dbconfig/20220405-091939-ladsgroup.json [09:19:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [09:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [09:19:43] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [09:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:46] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T298565)', diff saved to https://phabricator.wikimedia.org/P24105 and previous config saved to /var/cache/conftool/dbconfig/20220405-091947-ladsgroup.json [09:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:21] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Monitor db_inventory rather than zarcillo section [puppet] - 10https://gerrit.wikimedia.org/r/776170 (https://phabricator.wikimedia.org/T301315) (owner: 10Jcrespo) [09:20:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:58] (03PS1) 10Btullis: Allow kikikube staging pods to access the analytics-meta test instance [puppet] - 10https://gerrit.wikimedia.org/r/777329 (https://phabricator.wikimedia.org/T301459) [09:21:19] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/777329 (https://phabricator.wikimedia.org/T301459) (owner: 10Btullis) [09:21:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:57] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [09:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:45] (03PS2) 10Btullis: Allow kikikube staging pods to access the analytics-meta test instance [puppet] - 10https://gerrit.wikimedia.org/r/777329 (https://phabricator.wikimedia.org/T301459) [09:29:17] (03CR) 10Elukey: [C: 03+1] Move kubestagemaster* to bullseye and upstream docker [puppet] - 10https://gerrit.wikimedia.org/r/777309 (https://phabricator.wikimedia.org/T305435) (owner: 10JMeybohm) [09:30:37] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/777329 (https://phabricator.wikimedia.org/T301459) (owner: 10Btullis) [09:33:01] (03PS2) 10Btullis: Remove the statsv source from the VarnishkafkaNoMessages alert [alerts] - 10https://gerrit.wikimedia.org/r/776912 (https://phabricator.wikimedia.org/T300246) [09:33:12] (03CR) 10Elukey: [C: 03+1] Move kubemaster2002 to bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777310 (https://phabricator.wikimedia.org/T305435) (owner: 10JMeybohm) [09:33:20] (03CR) 10Btullis: [V: 03+2 C: 03+2] Remove the statsv source from the VarnishkafkaNoMessages alert [alerts] - 10https://gerrit.wikimedia.org/r/776912 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [09:33:28] (03CR) 10Elukey: [C: 03+1] Move kubemaster2001 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/777311 (https://phabricator.wikimedia.org/T305435) (owner: 10JMeybohm) [09:33:51] (03CR) 10Elukey: [C: 03+1] Move kubemaster1002 to bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777327 (https://phabricator.wikimedia.org/T305435) (owner: 10JMeybohm) [09:34:05] (03CR) 10Kormat: dbtools: Port switchover-tmpl to python (031 comment) [software] - 10https://gerrit.wikimedia.org/r/776241 (https://phabricator.wikimedia.org/T304670) (owner: 10Ladsgroup) [09:34:21] (03CR) 10Elukey: [C: 03+1] Move kubemaster1001 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/777328 (https://phabricator.wikimedia.org/T305435) (owner: 10JMeybohm) [09:38:51] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34697/console" [puppet] - 10https://gerrit.wikimedia.org/r/777329 (https://phabricator.wikimedia.org/T301459) (owner: 10Btullis) [09:39:20] (03CR) 10Btullis: [V: 03+1 C: 03+2] Allow kikikube staging pods to access the analytics-meta test instance [puppet] - 10https://gerrit.wikimedia.org/r/777329 (https://phabricator.wikimedia.org/T301459) (owner: 10Btullis) [09:40:24] (03CR) 10Btullis: Allow kikikube staging pods to access the analytics-meta test instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777329 (https://phabricator.wikimedia.org/T301459) (owner: 10Btullis) [09:42:30] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:43:06] (03PS1) 10Filippo Giunchedi: WIP test replacing smokeping with blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/777330 (https://phabricator.wikimedia.org/T169860) [09:43:48] (03CR) 10jerkins-bot: [V: 04-1] WIP test replacing smokeping with blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/777330 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [09:46:30] shush jerkins [09:49:12] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [09:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:05] (03PS2) 10Elukey: role::ml_k8s::master: change the svc eqiad IP subnet [puppet] - 10https://gerrit.wikimedia.org/r/776879 (https://phabricator.wikimedia.org/T304673) [09:52:07] (03PS2) 10Elukey: role::ml_k8s::master: change the codfw svc IP range [puppet] - 10https://gerrit.wikimedia.org/r/776880 (https://phabricator.wikimedia.org/T304673) [09:52:09] (03PS1) 10Elukey: install_server: set Bullseye for ml-serve-ctrl* nodes [puppet] - 10https://gerrit.wikimedia.org/r/777332 (https://phabricator.wikimedia.org/T304673) [09:54:46] (03PS1) 10Arturo Borrero Gonzalez: cloudgw1001: use a custom name for the dataplane NIC [puppet] - 10https://gerrit.wikimedia.org/r/777333 [09:55:04] (03PS2) 10Arturo Borrero Gonzalez: cloudgw1001: use a custom name for the dataplane NIC [puppet] - 10https://gerrit.wikimedia.org/r/777333 (https://phabricator.wikimedia.org/T304598) [09:55:10] (03CR) 10Muehlenhoff: [C: 03+2] Move Prometheus Apache setup to separate profile [puppet] - 10https://gerrit.wikimedia.org/r/775296 (owner: 10Muehlenhoff) [09:58:57] (03CR) 10Arturo Borrero Gonzalez: "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1001/34698/" [puppet] - 10https://gerrit.wikimedia.org/r/777333 (https://phabricator.wikimedia.org/T304598) (owner: 10Arturo Borrero Gonzalez) [10:00:42] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: cp5002 memory errors on DIMM A4 - https://phabricator.wikimedia.org/T305423 (10Vgutierrez) [10:02:24] (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp5015 as cache::text_haproxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777317 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [10:02:43] (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp6007 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777318 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [10:02:57] (03CR) 10Klausman: [C: 03+1] install_server: set Bullseye for ml-serve-ctrl* nodes [puppet] - 10https://gerrit.wikimedia.org/r/777332 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey) [10:03:22] (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp5013 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777319 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [10:03:30] (03PS3) 10Arturo Borrero Gonzalez: cloudgw1001: use a custom name for the dataplane NIC [puppet] - 10https://gerrit.wikimedia.org/r/777333 (https://phabricator.wikimedia.org/T304598) [10:03:47] (03CR) 10Ayounsi: [C: 03+1] homer: adjust the daily diff start time [puppet] - 10https://gerrit.wikimedia.org/r/777300 (owner: 10Volans) [10:03:50] (03CR) 10Elukey: [C: 03+2] install_server: set Bullseye for ml-serve-ctrl* nodes [puppet] - 10https://gerrit.wikimedia.org/r/777332 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey) [10:04:03] (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp5007 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777320 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [10:04:43] (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp5001 as cache::upload_haproxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777321 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [10:05:29] (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp4035 as cache::text_haproxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777322 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [10:05:39] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected https://puppet-compiler.wmflabs.org/pcc-worker1002/34699/" [puppet] - 10https://gerrit.wikimedia.org/r/777333 (https://phabricator.wikimedia.org/T304598) (owner: 10Arturo Borrero Gonzalez) [10:05:59] (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp3052 as cache::text_haproxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777323 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [10:06:50] (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp4027 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777324 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [10:06:56] (03PS4) 10Arturo Borrero Gonzalez: cloudgw1001: use a custom name for the dataplane NIC [puppet] - 10https://gerrit.wikimedia.org/r/777333 (https://phabricator.wikimedia.org/T304598) [10:07:27] (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp4033 as cache::upload_haproxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777325 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [10:07:57] (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp4021 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777326 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [10:10:02] (03CR) 10Volans: [C: 03+2] homer: adjust the daily diff start time [puppet] - 10https://gerrit.wikimedia.org/r/777300 (owner: 10Volans) [10:12:25] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:13:12] (03CR) 10Btullis: "The change looks good, but I am confused by the statement:" [puppet] - 10https://gerrit.wikimedia.org/r/777292 (https://phabricator.wikimedia.org/T263277) (owner: 10Ayounsi) [10:14:11] (03CR) 10David Caro: [C: 03+1] "Got a question about how it gets applied, not really the patch itself." [puppet] - 10https://gerrit.wikimedia.org/r/777333 (https://phabricator.wikimedia.org/T304598) (owner: 10Arturo Borrero Gonzalez) [10:17:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T298565)', diff saved to https://phabricator.wikimedia.org/P24107 and previous config saved to /var/cache/conftool/dbconfig/20220405-101709-ladsgroup.json [10:17:10] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw1001: use a custom name for the dataplane NIC (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777333 (https://phabricator.wikimedia.org/T304598) (owner: 10Arturo Borrero Gonzalez) [10:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:13] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [10:18:06] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [10:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:49] (03PS1) 10Volans: prometheus: fix typo in docstring [software/pywmflib] - 10https://gerrit.wikimedia.org/r/777334 [10:19:58] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [10:19:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:35] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: conntrackd: refresh NIC name [puppet] - 10https://gerrit.wikimedia.org/r/777335 (https://phabricator.wikimedia.org/T304598) [10:23:18] (03CR) 10Volans: [C: 03+2] prometheus: fix typo in docstring [software/pywmflib] - 10https://gerrit.wikimedia.org/r/777334 (owner: 10Volans) [10:23:33] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: conntrackd: refresh NIC name [puppet] - 10https://gerrit.wikimedia.org/r/777335 (https://phabricator.wikimedia.org/T304598) (owner: 10Arturo Borrero Gonzalez) [10:25:37] (03CR) 10Ayounsi: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1001/34700/netflow6001.drmrs.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/777292 (https://phabricator.wikimedia.org/T263277) (owner: 10Ayounsi) [10:26:21] (03Merged) 10jenkins-bot: prometheus: fix typo in docstring [software/pywmflib] - 10https://gerrit.wikimedia.org/r/777334 (owner: 10Volans) [10:30:09] !log aborrero@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudgw1001.eqiad.wmnet with OS bullseye [10:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:40] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudgw1001.eqiad.wmnet with OS bullseye [10:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:48] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [10:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P24108 and previous config saved to /var/cache/conftool/dbconfig/20220405-103214-ladsgroup.json [10:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:46] (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs2003:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [10:38:06] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:39:01] (03CR) 10Volans: [C: 03+2] sre.SREBatchBase: additional customizations [cookbooks] - 10https://gerrit.wikimedia.org/r/776965 (owner: 10Volans) [10:39:08] (03CR) 10Volans: [C: 03+2] sre.cdn.roll-restart-varnish: improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/776966 (owner: 10Volans) [10:42:11] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw1001.eqiad.wmnet with reason: host reimage [10:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:21] (03Merged) 10jenkins-bot: sre.SREBatchBase: additional customizations [cookbooks] - 10https://gerrit.wikimedia.org/r/776965 (owner: 10Volans) [10:42:24] (03Merged) 10jenkins-bot: sre.cdn.roll-restart-varnish: improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/776966 (owner: 10Volans) [10:45:03] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw1001.eqiad.wmnet with reason: host reimage [10:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P24109 and previous config saved to /var/cache/conftool/dbconfig/20220405-104719-ladsgroup.json [10:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:44] (03CR) 10Arturo Borrero Gonzalez: "gerrit cannot rebase this one, please rebase by hand." [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749711 (owner: 10Majavah) [10:47:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10ayounsi) Noted! To keep track of the IRC conversation, echoing it here: > is that a hard blocker? or could it be fixed before those hosts are l... [10:55:12] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudgw1001.eqiad.wmnet with OS bullseye [10:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:44] (03PS3) 10Jgiannelos: maps: Re-enable OSM sync for on eqiad master [puppet] - 10https://gerrit.wikimedia.org/r/772453 (https://phabricator.wikimedia.org/T304984) [10:56:10] PROBLEM - SSH on wtp1041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:56:21] !log installer spicerack v2.4.0 on the cumin hosts [10:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:37] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [11:03:37] PROBLEM - Host tools.wmflabs.org is DOWN: PING CRITICAL - Packet loss = 100% [11:03:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T298565)', diff saved to https://phabricator.wikimedia.org/P24110 and previous config saved to /var/cache/conftool/dbconfig/20220405-110224-ladsgroup.json [11:03:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [11:03:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [11:03:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T298565)', diff saved to https://phabricator.wikimedia.org/P24111 and previous config saved to /var/cache/conftool/dbconfig/20220405-110232-ladsgroup.json [11:03:37] RECOVERY - Host tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms [11:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:38] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [11:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:17] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (Aitolkyn) - https://phabricator.wikimedia.org/T305299 (10MoritzMuehlenhoff) @Aitolkyn Can you please sign https://phabricator.wikimedia.org/L3 ? Then we're good to go. [11:05:26] (03CR) 10Btullis: [V: 03+1 C: 03+2] Allow wikikube staging pod range to access kafka eqiad-test cluster [puppet] - 10https://gerrit.wikimedia.org/r/777314 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis) [11:06:11] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudgw1001.eqiad.wmnet [11:06:11] !log aborrero@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host cloudgw1001.eqiad.wmnet [11:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:24] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudgw1001.eqiad.wmnet [11:06:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:31] (03PS1) 10Muehlenhoff: Enable access for Paramita Das [puppet] - 10https://gerrit.wikimedia.org/r/777339 [11:10:10] (03CR) 10jerkins-bot: [V: 04-1] Enable access for Paramita Das [puppet] - 10https://gerrit.wikimedia.org/r/777339 (owner: 10Muehlenhoff) [11:10:13] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1001.eqiad.wmnet [11:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:17] (03Abandoned) 10MMandere: site: Reimage cp5002 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/776871 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [11:11:25] (03PS2) 10Muehlenhoff: Enable access for Paramita Das [puppet] - 10https://gerrit.wikimedia.org/r/777339 (https://phabricator.wikimedia.org/T305298) [11:11:27] (03PS1) 10Majavah: P:openstack::puppetmaster: split ENC api to a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/777341 (https://phabricator.wikimedia.org/T295247) [11:12:05] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [11:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:49] (03Abandoned) 10MMandere: site: Reimage cp6007 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/776872 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [11:13:13] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [11:13:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:22] (03CR) 10Muehlenhoff: [C: 03+2] Enable access for Paramita Das [puppet] - 10https://gerrit.wikimedia.org/r/777339 (https://phabricator.wikimedia.org/T305298) (owner: 10Muehlenhoff) [11:15:05] !log depool cp5015 for reimage - T290005 [11:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:08] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [11:15:11] (03PS2) 10Majavah: P:openstack::puppetmaster: split ENC api to a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/777341 (https://phabricator.wikimedia.org/T295247) [11:16:36] (03CR) 10MMandere: [C: 03+2] site: Reimage cp5015 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777317 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [11:18:29] (03PS3) 10Majavah: P:openstack::puppetmaster: split ENC api to a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/777341 (https://phabricator.wikimedia.org/T295247) [11:20:42] (03PS4) 10Majavah: P:openstack::puppetmaster: split ENC api to a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/777341 (https://phabricator.wikimedia.org/T295247) [11:21:28] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34704/console" [puppet] - 10https://gerrit.wikimedia.org/r/777341 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [11:23:55] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp5015.eqsin.wmnet with OS buster [11:23:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:04] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp5015.eqsin.wmnet with OS buster [11:25:32] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [11:25:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:53] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (Aitolkyn) - https://phabricator.wikimedia.org/T305299 (10Aitolkyn) >>! In T305299#7831501, @MoritzMuehlenhoff wrote: > @Aitolkyn Can you please sign https://phabricator.wikimedia.org/L3 ? Then we're good to go. @MoritzMu... [11:30:45] (03PS5) 10Majavah: P:openstack::puppetmaster: split ENC api to a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/777341 (https://phabricator.wikimedia.org/T295247) [11:31:17] !log depool cp6007 for reimage - T290005 [11:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:20] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [11:32:29] (03PS6) 10Majavah: P:openstack::puppetmaster: split ENC api to a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/777341 (https://phabricator.wikimedia.org/T295247) [11:32:52] (03PS2) 10MMandere: site: Reimage cp6007 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777318 (https://phabricator.wikimedia.org/T290005) [11:33:34] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34705/console" [puppet] - 10https://gerrit.wikimedia.org/r/777341 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [11:34:23] (03CR) 10MMandere: [C: 03+2] site: Reimage cp6007 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777318 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [11:34:34] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (paramita_das) - https://phabricator.wikimedia.org/T305298 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff @paramita_das: Your access has been enabled and you should have received an email with instructi... [11:36:17] (03PS7) 10Majavah: P:openstack::puppetmaster: split ENC api to a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/777341 (https://phabricator.wikimedia.org/T295247) [11:37:34] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: move backups to /mnt/gitlab-backup [puppet] - 10https://gerrit.wikimedia.org/r/776230 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [11:37:41] (03PS5) 10Jelto: gitlab: move backups to /mnt/gitlab-backup [puppet] - 10https://gerrit.wikimedia.org/r/776230 (https://phabricator.wikimedia.org/T274463) [11:38:20] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (paramita_das) - https://phabricator.wikimedia.org/T305298 (10MoritzMuehlenhoff) And please note that your username is "paramd" (the UID you used when creating the account on wikitech). [11:38:29] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6007.drmrs.wmnet with OS buster [11:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:38] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6007.drmrs.wmnet with OS buster [11:38:43] (03PS1) 10Muehlenhoff: Enable access for aitolkyn [puppet] - 10https://gerrit.wikimedia.org/r/777342 [11:39:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1132 T305427', diff saved to https://phabricator.wikimedia.org/P24112 and previous config saved to /var/cache/conftool/dbconfig/20220405-113944-root.json [11:39:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:48] T305427: Slow DB query on 10.6 - https://phabricator.wikimedia.org/T305427 [11:41:19] (03PS8) 10Majavah: P:openstack::puppetmaster: split ENC api to a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/777341 (https://phabricator.wikimedia.org/T295247) [11:42:44] (03CR) 10Muehlenhoff: [C: 03+2] Enable access for aitolkyn [puppet] - 10https://gerrit.wikimedia.org/r/777342 (owner: 10Muehlenhoff) [11:45:30] !log jnuche@deploy1002 Started scap: resync wmf.6 to reapply security patches - T305212 [11:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:33] T305212: 1.39.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T305212 [11:47:15] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5015.eqsin.wmnet with reason: host reimage [11:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:20] !log jnuche@deploy1002 Finished scap: resync wmf.6 to reapply security patches - T305212 (duration: 02m 50s) [11:48:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:48] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (Aitolkyn) - https://phabricator.wikimedia.org/T305299 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff @Aitolkyn Your access has been enabled and you should have received an email with instructions how... [11:49:40] (03PS1) 10Jaime Nuche: group0 wikis to 1.39.0-wmf.6 refs T305212 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777344 [11:49:42] (03CR) 10Jaime Nuche: [C: 03+2] group0 wikis to 1.39.0-wmf.6 refs T305212 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777344 (owner: 10Jaime Nuche) [11:50:24] (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.6 refs T305212 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777344 (owner: 10Jaime Nuche) [11:50:26] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5015.eqsin.wmnet with reason: host reimage [11:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:51] (03PS2) 10Arturo Borrero Gonzalez: wmcs: toolforge: k8s: factorize build/deplo code into a manager class [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/774459 [11:52:10] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.6 refs T305212 [11:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:13] T305212: 1.39.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T305212 [11:53:47] (03PS1) 10Jelto: gitlab: add backup and restore intervals to cloud hiera [puppet] - 10https://gerrit.wikimedia.org/r/777345 (https://phabricator.wikimedia.org/T274463) [11:53:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [11:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [11:54:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [11:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [11:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:52] (03CR) 10jerkins-bot: [V: 04-1] wmcs: toolforge: k8s: factorize build/deplo code into a manager class [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/774459 (owner: 10Arturo Borrero Gonzalez) [11:56:08] RECOVERY - SSH on wtp1041.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:56:39] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6007.drmrs.wmnet with reason: host reimage [11:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:46] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34706/console" [puppet] - 10https://gerrit.wikimedia.org/r/777345 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [11:58:10] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: add backup and restore intervals to cloud hiera [puppet] - 10https://gerrit.wikimedia.org/r/777345 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [12:28:48] 10SRE, 10SRE Observability (FY2021/2022-Q4): SLO dashboard refinements - https://phabricator.wikimedia.org/T302842 (10lmata) [12:32:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P24115 and previous config saved to /var/cache/conftool/dbconfig/20220405-123227-ladsgroup.json [12:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:54] (03PS2) 10Filippo Giunchedi: WIP test replacing smokeping with blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/777330 (https://phabricator.wikimedia.org/T169860) [12:39:56] (03PS1) 10Filippo Giunchedi: WIP move core routers definitions to hiera [puppet] - 10https://gerrit.wikimedia.org/r/777347 (https://phabricator.wikimedia.org/T169860) [12:40:34] !log pool cp5015 with HAProxy as TLS termination layer - T290005 [12:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:37] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [12:41:08] (03CR) 10jerkins-bot: [V: 04-1] WIP test replacing smokeping with blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/777330 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [12:41:16] (03PS1) 10JMeybohm: Copy all helmfile-defaults to each subchart namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/777348 (https://phabricator.wikimedia.org/T301454) [12:42:01] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp1085.eqiad.wmnet [12:42:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:32] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:43:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [12:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:56] (03PS3) 10Filippo Giunchedi: WIP test replacing smokeping with blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/777330 (https://phabricator.wikimedia.org/T169860) [12:46:03] !log pool cp6007 with HAProxy as TLS termination layer - T290005 [12:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:06] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [12:46:51] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10ayounsi) >>! In T300977#7821702, @Isaac wrote: > Chiming in as a heavy user of the stat boxes. It's difficult for me to fo... [12:47:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T298565)', diff saved to https://phabricator.wikimedia.org/P24116 and previous config saved to /var/cache/conftool/dbconfig/20220405-124732-ladsgroup.json [12:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [12:47:35] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [12:47:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [12:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T298565)', diff saved to https://phabricator.wikimedia.org/P24117 and previous config saved to /var/cache/conftool/dbconfig/20220405-124745-ladsgroup.json [12:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [12:50:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [12:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:21] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1085.eqiad.wmnet [12:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:39] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp3064.esams.wmnet [12:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [12:56:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:39] (03CR) 10Btullis: [C: 03+1] "This looks excellent. Many thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/777348 (https://phabricator.wikimedia.org/T301454) (owner: 10JMeybohm) [12:58:38] (03CR) 10JMeybohm: [C: 03+2] Copy all helmfile-defaults to each subchart namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/777348 (https://phabricator.wikimedia.org/T301454) (owner: 10JMeybohm) [12:59:19] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host dragonfly-supernode1001.eqiad.wmnet [12:59:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:05] RoanKattouw, Lucas_WMDE, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220405T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:00:22] 10SRE, 10Performance-Team: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460 (10MoritzMuehlenhoff) [13:01:17] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dragonfly-supernode1001.eqiad.wmnet [13:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:26] (03Merged) 10jenkins-bot: Copy all helmfile-defaults to each subchart namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/777348 (https://phabricator.wikimedia.org/T301454) (owner: 10JMeybohm) [13:03:35] !log installing openssl updates from buster 10.12 point release [13:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:25] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3064.esams.wmnet [13:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:10] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:07:46] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4030.ulsfo.wmnet [13:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:24] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:11:18] I am a bit late for the window, but is there someone who can deploy two config patches for me? [13:11:50] zabe: hey, I'm around [13:12:25] looking [13:12:25] taavi, thx. I added them to the calender. [13:13:00] (03PS3) 10Majavah: Pin CheckUser actor migration to old schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776250 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [13:13:03] (03CR) 10Majavah: [C: 03+2] Pin CheckUser actor migration to old schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776250 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [13:13:53] (03Merged) 10jenkins-bot: Pin CheckUser actor migration to old schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776250 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [13:14:37] zabe: pulled to mwdebug1001, is there anything to test? [13:14:53] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4030.ulsfo.wmnet [13:14:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:25] 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Infrastructure-Foundations, 10SRE Observability (FY2021/2022-Q3): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10CDanis) [13:15:29] 10SRE, 10Infrastructure-Foundations: Advertised RSS/Atom feeds for wikimediastatus.net don't work - https://phabricator.wikimedia.org/T305174 (10CDanis) 05Open→03Resolved a:03CDanis [13:15:50] taavi, not really. nothing seems to break on testwiki, thats all I can really test [13:16:02] ok, I'll just sync then [13:16:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:02] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:776250|Pin CheckUser actor migration to old schema (T233004)]] (duration: 00m 54s) [13:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:04] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [13:17:19] (03CR) 10Majavah: [C: 03+2] Start writing to $wmgUdp2logDest the same value as to $wmfUdp2logDest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776257 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [13:17:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host apt2001.wikimedia.org [13:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:30] (03Merged) 10jenkins-bot: Start writing to $wmgUdp2logDest the same value as to $wmfUdp2logDest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776257 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [13:18:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:06] zabe: pulled the second too, I assume this too is not testable? [13:19:47] taavi, yep [13:19:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast4003.wikimedia.org [13:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:59] ok, syncing [13:20:03] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10ayounsi) @BTullis > I realize that this suggestion increases the scope if the task considerably yup :) We unfortunately do... [13:20:55] !log taavi@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:776257|Start writing to $wmgUdp2logDest the same value as to $wmfUdp2logDest (T45956)]] (duration: 00m 54s) [13:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:58] T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956 [13:21:04] ok, done [13:21:18] (ProbeDown) firing: Service apt:80 has failed probes (http_apt_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:21:28] thanks [13:21:28] umh [13:21:34] I don't think that alert is related [13:21:36] uuhh, investigating [13:21:39] not related no [13:21:48] * Emperor here [13:21:59] * volans here [13:22:08] looks like caused by apt2001 restart? [13:22:11] might be the apt2001 reboot? but why is that suddently alerting? [13:22:24] new paging probes :) [13:22:43] is it possible to know the full url of the service being unrech? [13:23:01] apt2001 uptime 2 mins [13:23:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4003.wikimedia.org [13:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:36] yes, apt2001 was intentionally rebooted along with downtime etc. and it's also the passive host [13:23:43] so that check should not alert at all [13:23:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:23:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2001.wikimedia.org [13:23:45] XioNoX: yes, it is in the logs [13:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:52] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on kubestagemaster2001.codfw.wmnet with reason: reimage [13:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:55] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on kubestagemaster2001.codfw.wmnet with reason: reimage [13:23:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:12] godog: where? [13:24:15] moritzm: I'll go ACK the VO alert then [13:24:23] Emperor: ack, thx [13:25:14] XioNoX: there's a link at https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown but I'll be linking the logs from the grafana dashboard too [13:25:16] also apt.wikimedia.org being down has no user-visible impact at all (except that Puppet runs will get stalled/failed), there's no reason it should page to begin with [13:25:45] moritzm: ok to set page: false in service::catalog for it then I guess ? [13:26:18] (ProbeDown) resolved: Service apt:80 has failed probes (http_apt_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:27:35] godog: yeah, let's make a patch and ask for people's opinions? from my POV there's no need to make it page, but would like to have a second opinion at least [13:27:52] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv4: OpenConfirm - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:28:14] moritzm: ack, yeah I tend to agree there's not really a paging need [13:28:23] godog: looking at the logs dashboard linked from that wiki page, i don't see a way to spot the issue with apt [13:28:41] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] Move kubestagemaster* to bullseye and upstream docker [puppet] - 10https://gerrit.wikimedia.org/r/777309 (https://phabricator.wikimedia.org/T305435) (owner: 10JMeybohm) [13:28:57] (03PS2) 10JMeybohm: Move kubestagemaster* to bullseye and upstream docker [puppet] - 10https://gerrit.wikimedia.org/r/777309 (https://phabricator.wikimedia.org/T305435) [13:29:27] kormat: I fished the IP out of the email from VO [13:30:13] Emperor: i don't think i got an email? [13:30:15] 10SRE, 10SRE-Access-Requests, 10Phabricator, 10Triagers, 10acl*phabricator: SRE access request to join #triagers for user lmata - https://phabricator.wikimedia.org/T305463 (10lmata) [13:30:30] kormat: ah, VO pages me by email (at least initially) [13:30:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:46] 10SRE, 10SRE-Access-Requests, 10Phabricator, 10Triagers, and 2 others: SRE access request to join #triagers for user lmata - https://phabricator.wikimedia.org/T305463 (10lmata) [13:31:31] kormat: that's fair yeah, the unfiltered logs dashboard is dense, the alert (in the UI) does have a link to a filtered view (i.e. with 'service.name'), I'll be working to make it more obvious when sth fails in the logs [13:31:35] godog: ack, I'll prepare a patch later [13:31:42] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp1086.eqiad.wmnet [13:31:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:51] (03PS1) 10JMeybohm: Revert "Move kubestagemaster* to bullseye and upstream docker" [puppet] - 10https://gerrit.wikimedia.org/r/777016 [13:32:30] (JobUnavailable) firing: (5) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:32:51] (03PS1) 10Volans: sre.cdn.roll-restart-varnish: use Thanos [cookbooks] - 10https://gerrit.wikimedia.org/r/777353 [13:32:56] 10SRE, 10SRE-Access-Requests, 10Phabricator, 10Triagers, and 3 others: SRE access request to join #triagers for user lmata - https://phabricator.wikimedia.org/T305463 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup {{done}} [13:33:24] (03CR) 10JMeybohm: [C: 03+2] Revert "Move kubestagemaster* to bullseye and upstream docker" [puppet] - 10https://gerrit.wikimedia.org/r/777016 (owner: 10JMeybohm) [13:33:26] (03PS2) 10Volans: sre.hosts.reimage: call Ipmi.remove_boot_override [cookbooks] - 10https://gerrit.wikimedia.org/r/774927 (https://phabricator.wikimedia.org/T304434) [13:34:01] well, at least the paging/probing works as expected [13:34:06] (03PS2) 10JMeybohm: Move kubemaster2002 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/777310 (https://phabricator.wikimedia.org/T305435) [13:35:48] (03CR) 10JMeybohm: [C: 03+2] Move kubemaster2002 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/777310 (https://phabricator.wikimedia.org/T305435) (owner: 10JMeybohm) [13:36:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:58] (03PS1) 10JMeybohm: Revert "Move kubemaster2002 to bullseye" [puppet] - 10https://gerrit.wikimedia.org/r/777017 [13:36:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:37:40] 10SRE, 10Phabricator, 10SRE Observability (FY2021/2022-Q4), 10User-Ladsgroup: SRE access request to join #triagers for user lmata - https://phabricator.wikimedia.org/T305463 (10Majavah) [13:38:44] (03CR) 10JMeybohm: [C: 03+2] Revert "Move kubemaster2002 to bullseye" [puppet] - 10https://gerrit.wikimedia.org/r/777017 (owner: 10JMeybohm) [13:39:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cuminunpriv1001.eqiad.wmnet [13:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host deneb.codfw.wmnet [13:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:00] (03PS1) 10JMeybohm: Move kubestagemaster* to bullseye and upstream docker [puppet] - 10https://gerrit.wikimedia.org/r/777018 [13:41:18] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: mass Yahoo / AOL bounces mailman - https://phabricator.wikimedia.org/T232417 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup Boldly doing so. Reopen if we get it again. [13:41:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cuminunpriv1001.eqiad.wmnet [13:41:33] !log klausman@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on 6 hosts with reason: Cluster re-init for new IP ranges [13:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:38] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 6 hosts with reason: Cluster re-init for new IP ranges [13:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:06] (03CR) 10JMeybohm: [C: 03+2] Move kubestagemaster* to bullseye and upstream docker [puppet] - 10https://gerrit.wikimedia.org/r/777018 (owner: 10JMeybohm) [13:42:10] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:43:26] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast3004.wikimedia.org [13:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:48] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:44:57] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cp1086.eqiad.wmnet [13:44:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:16] (03CR) 10Volans: [C: 03+2] sre.hosts.reimage: call Ipmi.remove_boot_override [cookbooks] - 10https://gerrit.wikimedia.org/r/774927 (https://phabricator.wikimedia.org/T304434) (owner: 10Volans) [13:45:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host deneb.codfw.wmnet [13:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:08] PROBLEM - Check systemd state on cp1086 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens2f0np0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:48:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T298565)', diff saved to https://phabricator.wikimedia.org/P24119 and previous config saved to /var/cache/conftool/dbconfig/20220405-134801-ladsgroup.json [13:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:04] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [13:48:51] (03Merged) 10jenkins-bot: sre.hosts.reimage: call Ipmi.remove_boot_override [cookbooks] - 10https://gerrit.wikimedia.org/r/774927 (https://phabricator.wikimedia.org/T304434) (owner: 10Volans) [13:49:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast3004.wikimedia.org [13:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:57] (03PS3) 10Ladsgroup: dbtools: Port switchover-tmpl to python [software] - 10https://gerrit.wikimedia.org/r/776241 (https://phabricator.wikimedia.org/T304670) [13:50:18] (03CR) 10jerkins-bot: [V: 04-1] dbtools: Port switchover-tmpl to python [software] - 10https://gerrit.wikimedia.org/r/776241 (https://phabricator.wikimedia.org/T304670) (owner: 10Ladsgroup) [13:52:47] (03CR) 10Ladsgroup: [C: 03+2] Add --dbgroupdefault=dump to every major dump run [dumps] - 10https://gerrit.wikimedia.org/r/767477 (https://phabricator.wikimedia.org/T138208) (owner: 10Ladsgroup) [13:53:15] (03Merged) 10jenkins-bot: Add --dbgroupdefault=dump to every major dump run [dumps] - 10https://gerrit.wikimedia.org/r/767477 (https://phabricator.wikimedia.org/T138208) (owner: 10Ladsgroup) [13:53:30] PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:53:38] o/ [13:58:33] !log depool cp5013 for reimage - T290005 [13:58:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:36] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [13:59:58] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 129 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:00:28] jouncebot: nowandnext [14:00:28] No deployments scheduled for the next 1 hour(s) and 59 minute(s) [14:00:28] In 1 hour(s) and 59 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220405T1600) [14:01:10] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:01:56] (03PS2) 10Ladsgroup: Enable videojs on all of DIP wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775294 (https://phabricator.wikimedia.org/T248418) [14:03:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P24120 and previous config saved to /var/cache/conftool/dbconfig/20220405-140306-ladsgroup.json [14:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:20] (03CR) 10Ladsgroup: [C: 03+2] Enable videojs on all of DIP wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775294 (https://phabricator.wikimedia.org/T248418) (owner: 10Ladsgroup) [14:03:48] (03PS2) 10MMandere: site: Reimage cp5013 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777319 (https://phabricator.wikimedia.org/T290005) [14:04:10] (03Merged) 10jenkins-bot: Enable videojs on all of DIP wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775294 (https://phabricator.wikimedia.org/T248418) (owner: 10Ladsgroup) [14:05:10] PROBLEM - Check systemd state on ms-be1064 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:25] (03CR) 10MMandere: [C: 03+2] site: Reimage cp5013 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777319 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [14:05:28] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:775294|Enable videojs on all of DIP wikis (T248418)]] (duration: 00m 53s) [14:05:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:31] T248418: Roll out videojs as the only video/audio player on all Wikimedia wikis - https://phabricator.wikimedia.org/T248418 [14:07:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:08:19] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp5013.eqsin.wmnet with OS buster [14:08:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:28] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp5013.eqsin.wmnet with OS buster [14:08:47] (03PS1) 10Muehlenhoff: Don't make apt.wikimedia.org page [puppet] - 10https://gerrit.wikimedia.org/r/777357 [14:09:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:12:05] !log depool cp5007 for reimage - T290005 [14:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:07] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [14:12:25] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:12:30] (JobUnavailable) firing: (5) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:12:37] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/777357 (owner: 10Muehlenhoff) [14:13:12] (03PS4) 10Giuseppe Lavagetto: cache::base: add check to netmapper modification [puppet] - 10https://gerrit.wikimedia.org/r/773451 (https://phabricator.wikimedia.org/T302471) [14:14:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:34] (03CR) 10Giuseppe Lavagetto: [C: 03+1] httpbb: Delete the git::clone and install via deb package [puppet] - 10https://gerrit.wikimedia.org/r/776977 (https://phabricator.wikimedia.org/T299705) (owner: 10RLazarus) [14:15:02] (03CR) 10CDanis: [C: 03+1] cache::base: add check to netmapper modification [puppet] - 10https://gerrit.wikimedia.org/r/773451 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [14:15:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:35] (03PS2) 10MMandere: site: Reimage cp5007 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777320 (https://phabricator.wikimedia.org/T290005) [14:16:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [14:17:47] (03CR) 10MMandere: [C: 03+2] site: Reimage cp5007 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777320 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [14:18:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P24121 and previous config saved to /var/cache/conftool/dbconfig/20220405-141811-ladsgroup.json [14:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:24] (03CR) 10Jcrespo: [C: 03+2] check: Read list of valid sections/valid backup jobs from a file [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/776171 (https://phabricator.wikimedia.org/T301315) (owner: 10Jcrespo) [14:18:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:13] (03PS2) 10Muehlenhoff: Don't make apt.wikimedia.org page [puppet] - 10https://gerrit.wikimedia.org/r/777357 [14:19:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:21:26] (03PS2) 10Giuseppe Lavagetto: varnish::frontend: remove temporary rate-limits [puppet] - 10https://gerrit.wikimedia.org/r/773454 [14:21:36] (03CR) 10Giuseppe Lavagetto: [C: 03+2] cache::base: add check to netmapper modification [puppet] - 10https://gerrit.wikimedia.org/r/773451 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [14:21:49] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/777357 (owner: 10Muehlenhoff) [14:21:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [14:21:58] PROBLEM - LVS inference codfw port 30443/tcp - Inference ML service IPv4 on inference.svc.codfw.wmnet is CRITICAL: connect to address inference.discovery.wmnet and port 30443: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:22:11] klausman: --^ [14:22:15] 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [14:22:26] 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) p:05Triage→03Medium [14:22:30] (JobUnavailable) firing: (7) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:22:46] (03PS1) 10JMeybohm: Move kubemaster2002 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/777021 [14:22:49] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp5007.eqsin.wmnet with OS buster [14:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:55] the LVS alarm is fine, we have stopped the ml eqiad cluster [14:22:58] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp5007.eqsin.wmnet with OS buster [14:24:00] elukey: I can't find that host in Icinga [14:24:38] that is... codfw [14:24:55] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:26:12] <_joe_> klausman: AFAICT, your alerting was actually reaching eqiad and not codfw [14:26:15] (JobUnavailable) firing: (9) Reduced availability for job calico-felix in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:26:42] _joe_ it may be misleading, it says that inference.discovery.wmnet is not reachable [14:26:58] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference_30443: Servers ml-serve1004.eqiad.wmnet, ml-serve1003.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:27:15] okok we are going to ack these [14:28:13] <_joe_> curl https://inference.svc.codfw.wmnet:30443/ returns 204, so I assume the alert is wrong, but we can check later [14:28:22] sure sure [14:28:35] * elukey takes notes [14:28:53] (03PS3) 10Giuseppe Lavagetto: varnish::frontend: remove temporary rate-limits [puppet] - 10https://gerrit.wikimedia.org/r/773454 [14:28:58] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:29:05] (03PS1) 10Jelto: gitlab: fix duplicate backup_dir hiera key [puppet] - 10https://gerrit.wikimedia.org/r/777359 (https://phabricator.wikimedia.org/T274463) [14:30:32] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on pc2014.codfw.wmnet with reason: Rebooting for T303174 [14:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:34] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on pc2014.codfw.wmnet with reason: Rebooting for T303174 [14:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:27] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5013.eqsin.wmnet with reason: host reimage [14:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:39] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on kubestagemaster1001.eqiad.wmnet with reason: reimage [14:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:41] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on kubestagemaster1001.eqiad.wmnet with reason: reimage [14:31:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:51] (03CR) 10Giuseppe Lavagetto: [C: 03+2] varnish::frontend: remove temporary rate-limits (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773454 (owner: 10Giuseppe Lavagetto) [14:31:59] !log mmandere@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp5013.eqsin.wmnet with reason: host reimage [14:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:08] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34707/console" [puppet] - 10https://gerrit.wikimedia.org/r/777359 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [14:33:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T298565)', diff saved to https://phabricator.wikimedia.org/P24122 and previous config saved to /var/cache/conftool/dbconfig/20220405-143316-ladsgroup.json [14:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:21] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [14:33:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [14:33:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [14:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 14 hosts with reason: Maintenance [14:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 14 hosts with reason: Maintenance [14:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:54] ACKNOWLEDGEMENT - LVS inference codfw port 30443/tcp - Inference ML service IPv4 on inference.svc.codfw.wmnet is CRITICAL: connect to address inference.discovery.wmnet and port 30443: No route to host Klausman Cluster re-init for new IP ranges (T304673) https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:33:54] ACKNOWLEDGEMENT - LVS inference eqiad port 30443/tcp - Inference ML service IPv4 on inference.svc.eqiad.wmnet is CRITICAL: connect to address inference.discovery.wmnet and port 30443: No route to host Klausman Cluster re-init for new IP ranges (T304673) https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:33:58] (KubernetesRsyslogDown) resolved: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:34:28] (03CR) 10Elukey: [C: 03+2] role::ml_k8s::master: change the svc eqiad IP subnet [puppet] - 10https://gerrit.wikimedia.org/r/776879 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey) [14:34:36] (03PS3) 10Elukey: role::ml_k8s::master: change the svc eqiad IP subnet [puppet] - 10https://gerrit.wikimedia.org/r/776879 (https://phabricator.wikimedia.org/T304673) [14:35:04] (03CR) 10Elukey: [C: 03+2] Change the Calico's pod IP subnet for ml-serve-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/776876 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey) [14:36:15] (JobUnavailable) firing: (10) Reduced availability for job calico-felix in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:36:29] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on pc2011.codfw.wmnet with reason: Rebooting for T303174 [14:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:30] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on pc2011.codfw.wmnet with reason: Rebooting for T303174 [14:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:20] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01286 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:37:46] (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs2003:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [14:38:04] (03CR) 10Ladsgroup: dbtools: Port switchover-tmpl to python (031 comment) [software] - 10https://gerrit.wikimedia.org/r/776241 (https://phabricator.wikimedia.org/T304670) (owner: 10Ladsgroup) [14:38:49] (03CR) 10Kormat: [C: 03+1] dbtools: Port switchover-tmpl to python [software] - 10https://gerrit.wikimedia.org/r/776241 (https://phabricator.wikimedia.org/T304670) (owner: 10Ladsgroup) [14:38:54] moritzm: did you restarted any of the puppetmasters by any chance? ^^^ Widespread puppet agent failures [14:40:07] (03PS4) 10Ladsgroup: dbtools: Port switchover-tmpl to python [software] - 10https://gerrit.wikimedia.org/r/776241 (https://phabricator.wikimedia.org/T304670) [14:41:15] (JobUnavailable) firing: (11) Reduced availability for job calico-felix in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:41:54] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! Thanks Moritz" [puppet] - 10https://gerrit.wikimedia.org/r/777357 (owner: 10Muehlenhoff) [14:42:28] RECOVERY - Check systemd state on cp1086 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:44:00] !log re-pool cp1086 [14:44:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:14] volans: kind of, I upgraded mod-auth-cas (which is installed on the Puppet master frontends for config-master.w.o) and that involves an Apache restart [14:44:44] should recover soonish [14:44:45] ack [14:44:47] thx [14:45:38] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - ml-ctrl_6443: Servers ml-serve-ctrl1002.eqiad.wmnet are marked down but pooled: inference_30443: Servers ml-serve1004.eqiad.wmnet, ml-serve1003.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:45:54] (03PS2) 10Giuseppe Lavagetto: varnish::frontend: remove normalization for parameter [puppet] - 10https://gerrit.wikimedia.org/r/773455 [14:46:15] (JobUnavailable) firing: (11) Reduced availability for job calico-felix in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:46:47] (03PS1) 10Jaime Nuche: test [mediawiki-config] (sandbox/jnuche) - 10https://gerrit.wikimedia.org/r/777363 [14:47:10] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5007.eqsin.wmnet with reason: host reimage [14:47:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:56] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:47:58] (KubernetesCalicoDown) firing: (6) ml-serve-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:48:10] (03CR) 10Giuseppe Lavagetto: [C: 03+2] varnish::frontend: remove normalization for parameter [puppet] - 10https://gerrit.wikimedia.org/r/773455 (owner: 10Giuseppe Lavagetto) [14:48:18] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1005.mgmt.eqiad.wmnet with reboot policy FORCED [14:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:36] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on pc2012.codfw.wmnet with reason: Rebooting for T303174 [14:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:38] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on pc2012.codfw.wmnet with reason: Rebooting for T303174 [14:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:04] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1006.mgmt.eqiad.wmnet with reboot policy FORCED [14:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:18] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [cookbooks] - 10https://gerrit.wikimedia.org/r/777353 (owner: 10Volans) [14:49:34] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-cache1001.mgmt.eqiad.wmnet with reboot policy FORCED [14:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:56] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:50:06] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5007.eqsin.wmnet with reason: host reimage [14:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:08] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4036.ulsfo.wmnet [14:50:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:11] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1007.mgmt.eqiad.wmnet with reboot policy FORCED [14:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:29] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.provision (exit_code=97) for host ml-cache1002.mgmt.eqiad.wmnet with reboot policy FORCED [14:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:44] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp1087.eqiad.wmnet [14:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:50] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.provision (exit_code=97) for host ml-cache1003.mgmt.eqiad.wmnet with reboot policy FORCED [14:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:03] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp3065.esams.wmnet [14:51:03] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.provision (exit_code=97) for host ml-serve1008.mgmt.eqiad.wmnet with reboot policy FORCED [14:51:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10Cmjohnson) @elukey closed [14:55:22] (03PS4) 10Jelto: gitlab_runner: override ExecStart in service unit for non-root [puppet] - 10https://gerrit.wikimedia.org/r/775821 (https://phabricator.wikimedia.org/T295481) [14:55:45] (03PS1) 10JMeybohm: Don't schedule calico kube-controllers on master nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/777364 [14:56:19] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on pc2013.codfw.wmnet with reason: Rebooting for T303174 [14:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:20] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on pc2013.codfw.wmnet with reason: Rebooting for T303174 [14:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:29] (03CR) 10Jelto: [C: 03+2] gitlab_runner: override ExecStart in service unit for non-root [puppet] - 10https://gerrit.wikimedia.org/r/775821 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [15:00:34] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5013.eqsin.wmnet with OS buster [15:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:43] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp5013.eqsin.wmnet with OS buster com... [15:01:01] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Taking host offline to upgrade to Bullseye [15:01:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:03] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Taking host offline to upgrade to Bullseye [15:01:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:47] (03PS1) 10Btullis: Update the chart to address issues with secrets and CI [deployment-charts] - 10https://gerrit.wikimedia.org/r/777365 (https://phabricator.wikimedia.org/T301454) [15:02:30] (JobUnavailable) firing: (11) Reduced availability for job calico-felix in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:02:54] !log pool cp5013 with HAProxy as TLS termination layer - T290005 [15:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:01] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [15:03:43] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es2020.codfw.wmnet with reason: Rebooting for T303174 [15:03:45] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es2020.codfw.wmnet with reason: Rebooting for T303174 [15:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:30] (03CR) 10Herron: [C: 03+2] logstash: set unit TimeoutStopSec of 2 minutes [puppet] - 10https://gerrit.wikimedia.org/r/776982 (https://phabricator.wikimedia.org/T305403) (owner: 10Herron) [15:09:58] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:10:32] !log razzi@cumin1001 START - Cookbook sre.hosts.reimage for host dbstore1003.eqiad.wmnet with OS bullseye [15:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:36] RECOVERY - Check systemd state on ms-be1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:11:02] PROBLEM - purged service on cp4036 is CRITICAL: CRITICAL - Expecting active but unit purged is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:11:15] (JobUnavailable) firing: (10) Reduced availability for job calico-felix in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:11:22] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1087.eqiad.wmnet [15:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:35] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3065.esams.wmnet [15:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:36] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es2022.codfw.wmnet with reason: Rebooting for T303174 [15:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:38] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es2022.codfw.wmnet with reason: Rebooting for T303174 [15:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:00] !log installing atftp security updates [15:12:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:22] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [15:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:29] (03PS1) 10Alexandros Kosiaris: mobileapps: Increase CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/777369 (https://phabricator.wikimedia.org/T305482) [15:15:18] (ProbeDown) firing: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:15:28] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1001.eqiad.wmnet with OS bullseye [15:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:40] 👋 [15:15:41] ooof mmhh [15:15:46] * Emperor is here [15:15:57] <_joe_> looks like wdqs is down in eqiad? [15:16:24] indeed [15:16:27] <_joe_> godog: do we have the url that is probed somewhere? [15:16:56] wdqs-ssl:443 ; I think what that means is defined in puppet? [15:17:08] _joe_: yes, I am fixing the logs dashboard, in the meantime https://logstash.wikimedia.org/goto/f03a5816909ab0f402ec801c17aca444 [15:17:15] (03CR) 10Alexandros Kosiaris: [C: 03+2] mobileapps: Increase CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/777369 (https://phabricator.wikimedia.org/T305482) (owner: 10Alexandros Kosiaris) [15:17:22] <_joe_> lvs sees everything as up [15:17:27] i.e. /readiness-probe [15:17:33] (03PS2) 10Volans: sre.cdn.roll-restart-varnish: use Thanos [cookbooks] - 10https://gerrit.wikimedia.org/r/777353 [15:17:47] eqiad host is 10.2.2.32 [15:18:30] yeah looks like it was quite a short blip heh [15:19:06] clearly the alert is too trigger happy, my apologies for the page folks [15:19:14] <_joe_> yeah, also np [15:19:20] (03PS1) 10Jaime Nuche: test [mediawiki-config] (sandbox/jnuche) - 10https://gerrit.wikimedia.org/r/777370 [15:19:22] (03CR) 10Jaime Nuche: [C: 03+2] test [mediawiki-config] (sandbox/jnuche) - 10https://gerrit.wikimedia.org/r/777370 (owner: 10Jaime Nuche) [15:19:26] <_joe_> that's why we tend to fine-tune alerts when they are introduced [15:19:30] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5007.eqsin.wmnet with OS buster [15:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:40] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp5007.eqsin.wmnet with OS buster com... [15:19:50] indeed [15:20:13] would love if we could also get the rule or at least the DC into the alert message [15:20:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [15:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [15:20:18] (ProbeDown) resolved: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:20] er, could get the *URL [15:20:21] too early :) [15:20:25] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbstore1003.eqiad.wmnet with reason: host reimage [15:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:34] (03PS1) 10Ladsgroup: ParserOutputAccess: Allow calling getPO with option of not saving in PC [core] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/777388 (https://phabricator.wikimedia.org/T285993) [15:20:46] jouncebot: nowandnext [15:20:46] No deployments scheduled for the next 0 hour(s) and 39 minute(s) [15:20:46] In 0 hour(s) and 39 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220405T1600) [15:20:50] noice [15:20:55] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es2021.codfw.wmnet with reason: Rebooting for T303174 [15:20:56] ok sending a patch for a longer 'for' clause [15:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:57] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es2021.codfw.wmnet with reason: Rebooting for T303174 [15:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:22] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.002675 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:21:31] (03CR) 10Ladsgroup: [C: 03+2] ParserOutputAccess: Allow calling getPO with option of not saving in PC [core] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/777388 (https://phabricator.wikimedia.org/T285993) (owner: 10Ladsgroup) [15:22:01] (03Merged) 10jenkins-bot: mobileapps: Increase CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/777369 (https://phabricator.wikimedia.org/T305482) (owner: 10Alexandros Kosiaris) [15:22:12] (03Abandoned) 10Jaime Nuche: test [mediawiki-config] (sandbox/jnuche) - 10https://gerrit.wikimedia.org/r/777363 (owner: 10Jaime Nuche) [15:22:15] <_joe_> godog: I am looking at the logs from lvs and indeed it seems the server that caused the page was depooled at 15:15:19 [15:22:28] <_joe_> pybal does the same query as your probe [15:23:02] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [15:23:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:03] <_joe_> so I would be surprised if we'd end up with paging if you had even one retry before we run out of backends that are healthy according to pybal [15:23:15] !log pool cp5007 with HAProxy as TLS termination layer - T290005 [15:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:21] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [15:23:23] <_joe_> else it means pybal [15:23:27] RECOVERY - purged service on cp4036 is OK: OK - purged is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:23:32] <_joe_> is severely overloaded and not checking enough [15:23:43] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp1088.eqiad.wmnet [15:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:45] _joe_: yeah definitely 1m is too short, sending review for 2m now [15:25:01] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1003.eqiad.wmnet with reason: host reimage [15:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:04] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [15:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:07] or we can keep 1m but lower availability (i.e. failed probes) [15:25:25] <_joe_> godog: I would go with multiple failed probes before paging [15:25:27] <_joe_> as in [15:25:43] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [15:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:46] <_joe_> if we fail for 3 tries in a row [15:26:00] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [15:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:01] <_joe_> sadly if we say 2m but then have just one datapoint in that interval [15:26:26] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [15:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:39] in this case the probes are sent every 15s though, unlike icinga [15:26:51] from two prometheus hosts in codfw/eqiad [15:27:05] going with lower availability is valid too [15:27:21] git review isn't cooperating [15:27:59] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4036.ulsfo.wmnet [15:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:06] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1001.eqiad.wmnet with reason: host reimage [15:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:07] (03PS1) 10Filippo Giunchedi: sre: page after 2m of < 75% avail for network probes [alerts] - 10https://gerrit.wikimedia.org/r/777371 [15:28:36] or more retries, i.e. 1m but say 60% avail [15:28:57] 10SRE, 10WMF-General-or-Unknown, 10WMF-Legal, 10Documentation, and 2 others: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270 (10Ladsgroup) If we can get an explicit approval by legal to license all contributions of wikimedia.org email addresses to apache 2.0. I can start ma... [15:31:11] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1002.eqiad.wmnet with OS bullseye [15:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:20] (03PS2) 10Zabe: Migrate $wmfUdp2logDest to $wmgUdp2logDest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776258 (https://phabricator.wikimedia.org/T45956) [15:31:24] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1001.eqiad.wmnet with reason: host reimage [15:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:37] ok going with https://gerrit.wikimedia.org/r/c/operations/alerts/+/777371 for now, unless there are objections [15:31:40] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1003.eqiad.wmnet with OS bullseye [15:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:51] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1088.eqiad.wmnet [15:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:56] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1004.eqiad.wmnet with OS bullseye [15:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:20] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp1089.eqiad.wmnet [15:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:58] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:35:39] 10SRE, 10WMF-General-or-Unknown, 10WMF-Legal, 10Documentation, and 2 others: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270 (10MoritzMuehlenhoff) >>! In T67270#7832416, @Ladsgroup wrote: > If we can get an explicit approval by legal to license all contributions of wikimedi... [15:37:07] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: page after 2m of < 75% avail for network probes [alerts] - 10https://gerrit.wikimedia.org/r/777371 (owner: 10Filippo Giunchedi) [15:39:10] (03CR) 10jerkins-bot: [V: 04-1] ParserOutputAccess: Allow calling getPO with option of not saving in PC [core] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/777388 (https://phabricator.wikimedia.org/T285993) (owner: 10Ladsgroup) [15:39:12] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbstore1003.eqiad.wmnet with OS bullseye [15:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:54] (03PS1) 10Jaime Nuche: test [mediawiki-config] (sandbox/jnuche) - 10https://gerrit.wikimedia.org/r/777373 [15:39:56] (03CR) 10Jaime Nuche: [C: 03+2] test [mediawiki-config] (sandbox/jnuche) - 10https://gerrit.wikimedia.org/r/777373 (owner: 10Jaime Nuche) [15:40:03] (03Merged) 10jenkins-bot: ParserOutputAccess: Allow calling getPO with option of not saving in PC [core] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/777388 (https://phabricator.wikimedia.org/T285993) (owner: 10Ladsgroup) [15:40:41] (03Merged) 10jenkins-bot: test [mediawiki-config] (sandbox/jnuche) - 10https://gerrit.wikimedia.org/r/777373 (owner: 10Jaime Nuche) [15:40:45] !log drain ganeti2019 T305469 [15:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:48] T305469: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 [15:41:31] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [15:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:45] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1089.eqiad.wmnet [15:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:04] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.5/includes: Backport: [[gerrit:777388|ParserOutputAccess: Allow calling getPO with option of not saving in PC (T285993)]] (duration: 01m 00s) [15:42:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:07] T285993: [SPIKE] Estimate growth in demand for Parser Cache storage - https://phabricator.wikimedia.org/T285993 [15:43:10] (03PS1) 10Herron: sre.kafka.reboot-workers: add logging-codfw targets [cookbooks] - 10https://gerrit.wikimedia.org/r/777375 (https://phabricator.wikimedia.org/T279342) [15:43:26] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1001.eqiad.wmnet with OS bullseye [15:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:49] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1002.eqiad.wmnet with reason: host reimage [15:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:51] ugh, there might be a bit of errors [15:44:00] that is me but should be recovered by now [15:44:11] Cannot access private const [15:44:12] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1003.eqiad.wmnet with reason: host reimage [15:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:42] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1004.eqiad.wmnet with reason: host reimage [15:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:44:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:53] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es2024.codfw.wmnet with reason: Rebooting for T303174 [15:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:55] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es2024.codfw.wmnet with reason: Rebooting for T303174 [15:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:12] (03CR) 10Filippo Giunchedi: "LGTM, see inline for commit message adjustment" [cookbooks] - 10https://gerrit.wikimedia.org/r/777353 (owner: 10Volans) [15:46:41] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1002.eqiad.wmnet with reason: host reimage [15:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:44] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ml-serve1004.eqiad.wmnet with reason: host reimage [15:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:42] !log razzi@cumin1001 START - Cookbook sre.hosts.remove-downtime for dbstore1003.eqiad.wmnet [15:47:43] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dbstore1003.eqiad.wmnet [15:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:48:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:49:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:21] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Upgrade dbstore1005 to bullseye [15:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:23] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Upgrade dbstore1005 to bullseye [15:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:24] (03PS3) 10Volans: sre.cdn.roll-restart-varnish: use Thanos [cookbooks] - 10https://gerrit.wikimedia.org/r/777353 [15:49:29] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1003.eqiad.wmnet with reason: host reimage [15:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:40] (03CR) 10Volans: "addressed comment" [cookbooks] - 10https://gerrit.wikimedia.org/r/777353 (owner: 10Volans) [15:50:46] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 107 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [15:52:29] !log razzi@cumin1001 START - Cookbook sre.hosts.reimage for host dbstore1005.eqiad.wmnet with OS bullseye [15:52:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:25] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [15:53:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:31] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! Thank you" [cookbooks] - 10https://gerrit.wikimedia.org/r/777353 (owner: 10Volans) [15:55:32] PROBLEM - Host ml-serve1004 is DOWN: PING CRITICAL - Packet loss = 100% [15:56:15] (JobUnavailable) firing: (6) Reduced availability for job calico-felix in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:57:30] (JobUnavailable) firing: (6) Reduced availability for job calico-felix in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:58:26] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1002.eqiad.wmnet with OS bullseye [15:58:26] (03PS4) 10Volans: sre.cdn.roll-restart-varnish: use Thanos [cookbooks] - 10https://gerrit.wikimedia.org/r/777353 [15:58:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:04] RECOVERY - Host ml-serve1004 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [15:59:25] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1004.eqiad.wmnet with OS bullseye [15:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:58] (KubernetesRsyslogDown) firing: (3) rsyslog on ml-serve-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:00:05] jbond and rzl: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220405T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:01:15] (JobUnavailable) firing: (6) Reduced availability for job calico-felix in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:01:16] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp1090.eqiad.wmnet [16:01:16] 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Infrastructure-Foundations, 10SRE Observability (FY2021/2022-Q3): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10CDanis) New incidents and other posts on the status page will now automatically be... [16:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:31] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1003.eqiad.wmnet with OS bullseye [16:01:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:45] (03CR) 10JMeybohm: [C: 03+1] Update the chart to address issues with secrets and CI [deployment-charts] - 10https://gerrit.wikimedia.org/r/777365 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [16:02:02] (03CR) 10Btullis: [C: 03+2] Update the chart to address issues with secrets and CI [deployment-charts] - 10https://gerrit.wikimedia.org/r/777365 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [16:02:16] (03PS1) 10Ahmon Dancy: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/777381 [16:02:18] (03CR) 10Ahmon Dancy: [C: 03+2] Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/777381 (owner: 10Ahmon Dancy) [16:02:20] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbstore1005.eqiad.wmnet with reason: host reimage [16:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:53] (03CR) 10jerkins-bot: [V: 04-1] Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/777381 (owner: 10Ahmon Dancy) [16:03:46] (03PS2) 10Ahmon Dancy: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/777381 [16:03:48] (03CR) 10Ahmon Dancy: [C: 03+2] Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/777381 (owner: 10Ahmon Dancy) [16:04:54] (03Merged) 10jenkins-bot: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/777381 (owner: 10Ahmon Dancy) [16:05:13] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1005.eqiad.wmnet with reason: host reimage [16:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:18] (03CR) 10Volans: [C: 03+2] sre.cdn.roll-restart-varnish: use Thanos [cookbooks] - 10https://gerrit.wikimedia.org/r/777353 (owner: 10Volans) [16:06:27] (03Merged) 10jenkins-bot: Update the chart to address issues with secrets and CI [deployment-charts] - 10https://gerrit.wikimedia.org/r/777365 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [16:07:46] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [16:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:48] !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [16:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:05] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [16:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:07] !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [16:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:35] (03Merged) 10jenkins-bot: sre.cdn.roll-restart-varnish: use Thanos [cookbooks] - 10https://gerrit.wikimedia.org/r/777353 (owner: 10Volans) [16:09:18] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1090.eqiad.wmnet [16:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:15] (JobUnavailable) firing: (6) Reduced availability for job calico-felix in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:17:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10Cmjohnson) Moved all 3 servers to xe-0/0/28 on their respective switches, and committed the change on homer. [16:17:50] (03PS1) 10Majavah: O:openstack: add new encapi roles [puppet] - 10https://gerrit.wikimedia.org/r/777385 (https://phabricator.wikimedia.org/T295247) [16:18:45] 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q4:(Need By: TBD) rack/setup/install krb2002 - https://phabricator.wikimedia.org/T305488 (10RobH) [16:18:47] 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: (Need By: TBD) rack/setup/install pki2002 - https://phabricator.wikimedia.org/T305489 (10RobH) [16:18:56] 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q4:(Need By: TBD) rack/setup/install krb2002 - https://phabricator.wikimedia.org/T305488 (10RobH) [16:19:03] 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: (Need By: TBD) rack/setup/install pki2002 - https://phabricator.wikimedia.org/T305489 (10RobH) [16:19:30] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbstore1005.eqiad.wmnet with OS bullseye [16:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:08] (03PS2) 10Majavah: O:openstack: add new encapi roles [puppet] - 10https://gerrit.wikimedia.org/r/777385 (https://phabricator.wikimedia.org/T295247) [16:20:46] 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q4:(Need By: TBD) rack/setup/install pki2002 - https://phabricator.wikimedia.org/T305489 (10RobH) [16:20:53] (03PS1) 10Razzi: aqs: update mediawiki history snapshot for March 2022 [puppet] - 10https://gerrit.wikimedia.org/r/777407 [16:20:58] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34709/console" [puppet] - 10https://gerrit.wikimedia.org/r/777385 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [16:21:09] (03CR) 10Majavah: O:openstack: add new encapi roles [puppet] - 10https://gerrit.wikimedia.org/r/777385 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [16:23:13] 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [16:27:14] 10SRE, 10ops-eqiad: Degraded RAID on thanos-be1003 - https://phabricator.wikimedia.org/T304873 (10Cmjohnson) 05Open→03Resolved The disk has been replaced and is back online cmjohnson@thanos-be1003:~$ sudo megacli -CfgEachDskRaid0 WB RA Direct CachedBadBBU -a0 Adapter 0: Created VD 12 Configured physical... [16:27:26] (03PS2) 10Krinkle: tests: rename $wmfConfigDir to $configDir [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776253 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [16:28:24] (03CR) 10Krinkle: [C: 03+1] "LGTM. I updated the commit message as the current name wasn't a mistake. The directory is called wmf-config which seems fair to name as $w" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776253 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [16:28:41] RECOVERY - MegaRAID on thanos-be1003 is OK: OK: optimal, 14 logical, 14 physical https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:32:17] !log razzi@cumin1001 START - Cookbook sre.hosts.remove-downtime for dbstore1005.eqiad.wmnet [16:32:17] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dbstore1005.eqiad.wmnet [16:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:54] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Upgrade dbstore1007 to bullseye [16:32:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:57] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Upgrade dbstore1007 to bullseye [16:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [16:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [16:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1163 (T298565)', diff saved to https://phabricator.wikimedia.org/P24123 and previous config saved to /var/cache/conftool/dbconfig/20220405-163454-ladsgroup.json [16:34:55] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [16:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:59] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [16:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:03] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [16:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:11] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [16:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:58] !log razzi@cumin1001 START - Cookbook sre.hosts.reimage for host dbstore1007.eqiad.wmnet with OS bullseye [16:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:36] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs.backy2: fix typo in link to runbook for backup_vms [puppet] - 10https://gerrit.wikimedia.org/r/775961 (https://phabricator.wikimedia.org/T304408) (owner: 10Nskaggs) [16:38:51] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudgw1001.eqiad.wmnet [16:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:36] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [16:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:38] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [16:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:48] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [16:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:49] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [16:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:47] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [16:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:48] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [16:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:30] (JobUnavailable) firing: (2) Reduced availability for job calico-felix in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:42:40] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1001.eqiad.wmnet [16:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:45] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [16:43:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:46] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [16:43:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:15] (JobUnavailable) resolved: (2) Reduced availability for job calico-felix in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:48:13] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbstore1007.eqiad.wmnet with reason: host reimage [16:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:32] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [16:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:35] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudgw1001.eqiad.wmnet [16:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:04] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1007.eqiad.wmnet with reason: host reimage [16:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:36] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1001.eqiad.wmnet [16:52:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:43] RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:54:09] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1146.eqiad.wmnet with OS buster [16:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host an-worker1146.eqiad.wmnet with OS buster [16:58:06] (03PS1) 10Ahmon Dancy: train-dev fixups [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/777410 [16:58:08] (03CR) 10Ahmon Dancy: [C: 03+2] train-dev fixups [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/777410 (owner: 10Ahmon Dancy) [16:58:47] (03Merged) 10jenkins-bot: train-dev fixups [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/777410 (owner: 10Ahmon Dancy) [16:59:15] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [16:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:04] (03PS1) 10MSantos: mobileapps: bump to 2022-04-04-120513-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/777412 [17:02:50] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [17:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:52] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [17:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:04] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudgw1001.eqiad.wmnet [17:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:35] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp1046.eqiad.wmnet [17:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:37] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbstore1007.eqiad.wmnet with OS bullseye [17:05:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:18] !log razzi@cumin1001 START - Cookbook sre.hosts.remove-downtime for dbstore1007.eqiad.wmnet [17:06:18] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dbstore1007.eqiad.wmnet [17:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:01] (03CR) 10RLazarus: [C: 03+2] httpbb: Delete the git::clone and install via deb package [puppet] - 10https://gerrit.wikimedia.org/r/776977 (https://phabricator.wikimedia.org/T299705) (owner: 10RLazarus) [17:08:19] !log wtp1046 - rebooting [17:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:38] (03PS1) 10Elukey: Update ml-serve-eqiad's dnscore pod IPs after cluster reinit [puppet] - 10https://gerrit.wikimedia.org/r/777413 (https://phabricator.wikimedia.org/T304673) [17:09:37] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp1045.eqiad.wmnet [17:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:35] (03PS1) 10Elukey: Change ml-serve-eqiad coredns' pod IP after cluster reinit [deployment-charts] - 10https://gerrit.wikimedia.org/r/777414 (https://phabricator.wikimedia.org/T304673) [17:10:52] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [17:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:23] !log serially rebooting hosts in the wtp104* range [17:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:40] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1046.eqiad.wmnet [17:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:29] PROBLEM - Host wtp1045 is DOWN: PING CRITICAL - Packet loss = 100% [17:14:37] (03PS1) 10RLazarus: httpbb: Restore params to absented systemd::timer::job [puppet] - 10https://gerrit.wikimedia.org/r/777415 (https://phabricator.wikimedia.org/T299705) [17:14:44] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [17:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:03] RECOVERY - Host wtp1045 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [17:16:07] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34711/console" [puppet] - 10https://gerrit.wikimedia.org/r/777415 (https://phabricator.wikimedia.org/T299705) (owner: 10RLazarus) [17:16:40] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp1044.eqiad.wmnet [17:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:58] (03CR) 10RLazarus: [V: 03+1 C: 03+2] httpbb: Restore params to absented systemd::timer::job [puppet] - 10https://gerrit.wikimedia.org/r/777415 (https://phabricator.wikimedia.org/T299705) (owner: 10RLazarus) [17:17:16] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1045.eqiad.wmnet [17:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:05] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 108 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [17:18:21] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp1043.eqiad.wmnet [17:18:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:25] PROBLEM - Host wtp1043 is DOWN: PING CRITICAL - Packet loss = 100% [17:21:31] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1001.eqiad.wmnet [17:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:47] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1044.eqiad.wmnet [17:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:27] RECOVERY - Host wtp1043 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [17:22:59] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4021.ulsfo.wmnet [17:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:35] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1146.eqiad.wmnet with OS buster [17:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host an-worker1146.eqiad.wmnet with OS buster execut... [17:23:43] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp1042.eqiad.wmnet [17:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:48] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1043.eqiad.wmnet [17:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:00] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp6001.drmrs.wmnet [17:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:04] (03PS1) 10Eigyan: [config]: Undeploy GDI survey from EN,FR and ES wikis in PROD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777416 (https://phabricator.wikimedia.org/T303962) [17:25:26] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [17:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:05] PROBLEM - Check systemd state on dbstore1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:26:33] PROBLEM - Host wtp1042 is DOWN: PING CRITICAL - Packet loss = 100% [17:27:17] ACKNOWLEDGEMENT - Host wtp1042 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn reboot [17:27:41] RECOVERY - Host wtp1042 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [17:28:26] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp1041.eqiad.wmnet [17:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:32] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1042.eqiad.wmnet [17:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:25] RECOVERY - Check systemd state on dbstore1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:29:28] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4021.ulsfo.wmnet [17:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:25] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4033.ulsfo.wmnet [17:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:05] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6001.drmrs.wmnet [17:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T298565)', diff saved to https://phabricator.wikimedia.org/P24124 and previous config saved to /var/cache/conftool/dbconfig/20220405-173143-ladsgroup.json [17:31:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:47] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [17:31:51] PROBLEM - Host wtp1041 is DOWN: PING CRITICAL - Packet loss = 100% [17:32:17] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp1040.eqiad.wmnet [17:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:49] RECOVERY - Host wtp1041 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [17:33:18] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1041.eqiad.wmnet [17:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:58] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: relocate dataplane-specific sysctl params to ifupdown [puppet] - 10https://gerrit.wikimedia.org/r/777418 (https://phabricator.wikimedia.org/T305494) [17:34:15] (03PS1) 10Btullis: Add the networkpoliy for the setups as a pre-install hook [deployment-charts] - 10https://gerrit.wikimedia.org/r/777419 (https://phabricator.wikimedia.org/T301454) [17:34:46] (03CR) 10jerkins-bot: [V: 04-1] cloudgw: relocate dataplane-specific sysctl params to ifupdown [puppet] - 10https://gerrit.wikimedia.org/r/777418 (https://phabricator.wikimedia.org/T305494) (owner: 10Arturo Borrero Gonzalez) [17:34:48] (03PS2) 10Btullis: Add the networkpolicy for the setups as a pre-install hook [deployment-charts] - 10https://gerrit.wikimedia.org/r/777419 (https://phabricator.wikimedia.org/T301454) [17:35:26] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: relocate dataplane-specific sysctl params to ifupdown [puppet] - 10https://gerrit.wikimedia.org/r/777418 (https://phabricator.wikimedia.org/T305494) [17:35:51] PROBLEM - Host wtp1040 is DOWN: PING CRITICAL - Packet loss = 100% [17:36:01] (03CR) 10jerkins-bot: [V: 04-1] cloudgw: relocate dataplane-specific sysctl params to ifupdown [puppet] - 10https://gerrit.wikimedia.org/r/777418 (https://phabricator.wikimedia.org/T305494) (owner: 10Arturo Borrero Gonzalez) [17:36:09] RECOVERY - Host wtp1040 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [17:36:40] (03PS3) 10Arturo Borrero Gonzalez: cloudgw: relocate dataplane-specific sysctl params to ifupdown [puppet] - 10https://gerrit.wikimedia.org/r/777418 (https://phabricator.wikimedia.org/T305494) [17:36:59] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1040.eqiad.wmnet [17:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:35] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4033.ulsfo.wmnet [17:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:40] (03PS4) 10Arturo Borrero Gonzalez: cloudgw: relocate dataplane-specific sysctl params to ifupdown [puppet] - 10https://gerrit.wikimedia.org/r/777418 (https://phabricator.wikimedia.org/T305494) [17:40:20] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4035.ulsfo.wmnet [17:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:36] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp6002.drmrs.wmnet [17:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:00] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1003/34712/" [puppet] - 10https://gerrit.wikimedia.org/r/777418 (https://phabricator.wikimedia.org/T305494) (owner: 10Arturo Borrero Gonzalez) [17:41:44] (03CR) 10Majavah: cloudgw: relocate dataplane-specific sysctl params to ifupdown (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777418 (https://phabricator.wikimedia.org/T305494) (owner: 10Arturo Borrero Gonzalez) [17:45:34] (03CR) 10MSantos: [C: 03+2] mobileapps: bump to 2022-04-04-120513-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/777412 (owner: 10MSantos) [17:45:43] (03PS1) 10Herron: spicerack: add logging clusters to elasticsearch config [puppet] - 10https://gerrit.wikimedia.org/r/777421 (https://phabricator.wikimedia.org/T255864) [17:46:25] (03PS5) 10Arturo Borrero Gonzalez: cloudgw: relocate dataplane-specific sysctl params to ifupdown [puppet] - 10https://gerrit.wikimedia.org/r/777418 (https://phabricator.wikimedia.org/T305494) [17:46:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P24125 and previous config saved to /var/cache/conftool/dbconfig/20220405-174648-ladsgroup.json [17:46:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:09] (03CR) 10Arturo Borrero Gonzalez: cloudgw: relocate dataplane-specific sysctl params to ifupdown (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777418 (https://phabricator.wikimedia.org/T305494) (owner: 10Arturo Borrero Gonzalez) [17:48:06] (03CR) 10Herron: "something to get the ball rolling, AFAICT we'll need to decide if we want to open up ferm access direct from the cumin hosts, or use a pro" [puppet] - 10https://gerrit.wikimedia.org/r/777421 (https://phabricator.wikimedia.org/T255864) (owner: 10Herron) [17:48:23] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6002.drmrs.wmnet [17:48:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:25] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4035.ulsfo.wmnet [17:48:25] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1001/34713/" [puppet] - 10https://gerrit.wikimedia.org/r/777418 (https://phabricator.wikimedia.org/T305494) (owner: 10Arturo Borrero Gonzalez) [17:48:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:43] (03PS2) 10Herron: spicerack: add logging clusters to elasticsearch config [puppet] - 10https://gerrit.wikimedia.org/r/777421 (https://phabricator.wikimedia.org/T255864) [17:49:29] RECOVERY - Check systemd state on thanos-be1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:49:33] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp6003.drmrs.wmnet [17:49:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:49] (03Merged) 10jenkins-bot: mobileapps: bump to 2022-04-04-120513-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/777412 (owner: 10MSantos) [17:51:18] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=parse2020.wmnet [17:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:40] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] cloudgw: relocate dataplane-specific sysctl params to ifupdown [puppet] - 10https://gerrit.wikimedia.org/r/777418 (https://phabricator.wikimedia.org/T305494) (owner: 10Arturo Borrero Gonzalez) [17:51:58] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=parse201[7-9].wmnet [17:51:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:25] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=parse201[7-9].codfw.wmnet [17:52:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:14] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudgw2001-dev.codfw.wmnet [17:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:42] !log dzahn@cumin2002 START - Cookbook sre.hosts.reboot-single for host parse2020.codfw.wmnet [17:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:15] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:56:14] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudgw1001.eqiad.wmnet [17:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:43] 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [17:56:59] (03PS5) 10Herron: ipmiseld: ensure service enabled and running [puppet] - 10https://gerrit.wikimedia.org/r/775875 (https://phabricator.wikimedia.org/T305147) [17:57:21] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 47967 bytes in 3.681 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:57:33] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw2001-dev.codfw.wmnet [17:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:02] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6003.drmrs.wmnet [17:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:10] !log rebooting hosts in the parse201* range, starting with parse2019, counting down [17:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:26] (03CR) 10Herron: ipmiseld: ensure service enabled and running (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/775875 (https://phabricator.wikimedia.org/T305147) (owner: 10Herron) [17:58:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10cmooney) ^^ Above reimage seemed to fail due to some disk problem, I suspect maybe the raid config needs to be done in the BIOS (I was running... [17:58:57] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=parse2020.codfw.wmnet [17:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:38] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudgw2001-dev.codfw.wmnet [17:59:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:48] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host parse2020.codfw.wmnet [17:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:26] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1001.eqiad.wmnet [18:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:33] PROBLEM - Host parse2019 is DOWN: PING CRITICAL - Packet loss = 100% [18:01:29] RECOVERY - Host parse2019 is UP: PING OK - Packet loss = 0%, RTA = 33.18 ms [18:01:39] PROBLEM - Host parse2018 is DOWN: PING CRITICAL - Packet loss = 100% [18:01:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P24126 and previous config saved to /var/cache/conftool/dbconfig/20220405-180153-ladsgroup.json [18:01:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:23] RECOVERY - Host parse2018 is UP: PING OK - Packet loss = 0%, RTA = 33.21 ms [18:05:01] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw2001-dev.codfw.wmnet [18:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:12] 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [18:07:09] PROBLEM - Host parse2017 is DOWN: PING CRITICAL - Packet loss = 100% [18:08:17] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp6004.drmrs.wmnet [18:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:37] RECOVERY - Host parse2017 is UP: PING OK - Packet loss = 0%, RTA = 33.12 ms [18:12:24] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:15:52] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6004.drmrs.wmnet [18:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T298565)', diff saved to https://phabricator.wikimedia.org/P24127 and previous config saved to /var/cache/conftool/dbconfig/20220405-181658-ladsgroup.json [18:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:03] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [18:17:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [18:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [18:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1163 (T298565)', diff saved to https://phabricator.wikimedia.org/P24128 and previous config saved to /var/cache/conftool/dbconfig/20220405-181712-ladsgroup.json [18:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:52] (03PS1) 10RLazarus: Make the package name consistent [software/httpbb] - 10https://gerrit.wikimedia.org/r/777422 [18:18:28] (03PS2) 10RLazarus: Make the package name consistent [software/httpbb] - 10https://gerrit.wikimedia.org/r/777422 (https://phabricator.wikimedia.org/T299705) [18:19:50] (03CR) 10RLazarus: [C: 03+2] Make the package name consistent [software/httpbb] - 10https://gerrit.wikimedia.org/r/777422 (https://phabricator.wikimedia.org/T299705) (owner: 10RLazarus) [18:20:53] 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) @Marostegui or anyone in the DB team I am planning on moving all the db nodes in Rack B1 to Rack B5 but please see detail in the table in the description. I have db2109 already in... [18:21:03] (03Merged) 10jenkins-bot: Make the package name consistent [software/httpbb] - 10https://gerrit.wikimedia.org/r/777422 (https://phabricator.wikimedia.org/T299705) (owner: 10RLazarus) [18:22:09] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=parse2016.codfw.wmnet [18:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:36] (03PS1) 10JHathaway: mx: reject email to legacy mailing list domains [puppet] - 10https://gerrit.wikimedia.org/r/777423 (https://phabricator.wikimedia.org/T280472) [18:22:58] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=codfw,name=parse2020.codfw.wmnet [18:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:28] PROBLEM - Host parse2016 is DOWN: PING CRITICAL - Packet loss = 100% [18:23:34] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=parse2015.codfw.wmnet [18:23:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:15] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=codfw,name=parse2019.codfw.wmnet [18:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:18] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp6005.drmrs.wmnet [18:24:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:20] RECOVERY - Host parse2016 is UP: PING OK - Packet loss = 0%, RTA = 33.19 ms [18:24:36] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=codfw,name=parse2018.codfw.wmnet [18:24:37] (03PS2) 10JHathaway: mx: reject email to legacy mailing list domains [puppet] - 10https://gerrit.wikimedia.org/r/777423 (https://phabricator.wikimedia.org/T280472) [18:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:18] PROBLEM - Host parse2015 is DOWN: PING CRITICAL - Packet loss = 100% [18:25:28] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=codfw,name=parse2017.codfw.wmnet [18:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:50] RECOVERY - Host parse2015 is UP: PING OK - Packet loss = 0%, RTA = 33.20 ms [18:28:34] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=codfw,name=parse2016.codfw.wmnet [18:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:39] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=codfw,name=parse2015.codfw.wmnet [18:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:45] !log rzl@apt1001:~$ sudo -i reprepro -C main include buster-wikimedia /home/rzl/httpbb/buster/httpbb_0.0.1-1_amd64.changes # T299705 [18:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:49] T299705: Debian package for httpbb - https://phabricator.wikimedia.org/T299705 [18:29:06] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/777423 (https://phabricator.wikimedia.org/T280472) (owner: 10JHathaway) [18:30:03] 10SRE, 10Infrastructure-Foundations, 10observability, 10Patch-For-Review, 10User-MoritzMuehlenhoff: ipmiseld not running reliably - https://phabricator.wikimedia.org/T305147 (10herron) >>! In T305147#7824394, @MoritzMuehlenhoff wrote: > There's some things which are still puzzling here: Why wasn't this n... [18:31:22] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6005.drmrs.wmnet [18:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:28] !log rzl@apt1001:~$ sudo -i reprepro -C main include bullseye-wikimedia /home/rzl/httpbb/bullseye/httpbb_0.0.1-1+deb11u1_amd64.changes [18:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:21] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host mwlog2002.codfw.wmnet [18:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:46] (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs2003:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [18:41:33] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwlog2002.codfw.wmnet [18:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:35] 10SRE, 10User-Marostegui, 10User-fgiunchedi: Audit "misc" cluster hosts - https://phabricator.wikimedia.org/T210486 (10Umherirrender) [18:41:51] 10SRE: prometheus: ganglia-gen and outdated Ganglia:cluster resource name - https://phabricator.wikimedia.org/T186918 (10Umherirrender) [18:42:11] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host mwlog1002.eqiad.wmnet [18:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:22] 10SRE, 10Phabricator: Switch phabricator from using apache to nginx - https://phabricator.wikimedia.org/T185644 (10Umherirrender) [18:42:27] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp6006.drmrs.wmnet [18:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:17] 10SRE: operations/software repo: flake8 check - https://phabricator.wikimedia.org/T178877 (10Umherirrender) 05Open→03Resolved [18:43:19] 10SRE, 10User-Joe: etcd switchover/enhancements - https://phabricator.wikimedia.org/T159687 (10Umherirrender) [18:43:35] 10SRE: Setup PAWS internal experimentally on notebook* nodes - https://phabricator.wikimedia.org/T149543 (10Umherirrender) [18:43:57] (03CR) 10Dzahn: [C: 03+1] gitlab: fix duplicate backup_dir hiera key [puppet] - 10https://gerrit.wikimedia.org/r/777359 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [18:43:57] 10SRE, 10Wikimedia-Apache-configuration: catch-all apache vhost on the cluster should return 404 for non-existing sites - https://phabricator.wikimedia.org/T137176 (10Umherirrender) [18:44:22] 10SRE, 10IRCecho, 10Wikimedia-IRC-RC-Server: udpmxircecho spam/not working if unable to connect to irc server - https://phabricator.wikimedia.org/T134875 (10Umherirrender) [18:44:58] 10SRE, 10Deployments: Make l10nupdate user a system user - https://phabricator.wikimedia.org/T120585 (10Umherirrender) [18:45:19] 10SRE: Nutcracker stats monitoring should only listen on localhost - https://phabricator.wikimedia.org/T111934 (10Umherirrender) [18:45:52] PROBLEM - Check systemd state on webperf2002 is CRITICAL: CRITICAL - degraded: The following units failed: excimer-k8s-log.service,excimer-k8s-wall-log.service,excimer-log.service,excimer-wall-log.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:45:54] 10SRE: Rename 'restricted' group? - https://phabricator.wikimedia.org/T104671 (10Umherirrender) [18:46:42] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: fix duplicate backup_dir hiera key [puppet] - 10https://gerrit.wikimedia.org/r/777359 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [18:46:54] PROBLEM - Check systemd state on webperf1002 is CRITICAL: CRITICAL - degraded: The following units failed: excimer-k8s-log.service,excimer-k8s-wall-log.service,excimer-log.service,excimer-wall-log.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:47:32] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwlog1002.eqiad.wmnet [18:47:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:48] RECOVERY - Check systemd state on webperf2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:48:12] 10SRE, 10cloud-services-team (Kanban): Fix all .erb variable warnings - https://phabricator.wikimedia.org/T97251 (10Umherirrender) [18:48:26] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:49:39] (03CR) 10Scardenasmolinar: [C: 03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777416 (https://phabricator.wikimedia.org/T303962) (owner: 10Eigyan) [18:50:24] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:50:25] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6006.drmrs.wmnet [18:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:51:56] 👋 [18:52:06] 👋 [18:52:11] hey [18:52:15] this is a new alert, right? [18:52:22] here too [18:52:24] I thought it was my IRC shell box [18:52:26] apparently not [18:52:32] * volans here [18:52:32] cdanis: correct [18:52:33] cdanis: it is yeah [18:52:37] is it expected that the only alert I see on https://alerts.wikimedia.org/?q=alertname%3DProbeDown is for a different service entirely? [18:52:40] yes is the new one, might be a false alarm [18:53:15] oh [18:53:16] cdanis: I see it [18:53:19] apparently I was holding it wrong? [18:53:28] it only showed service inference:30443 until I pressed a red button [18:53:41] it didn't show up the first time I clicked and then it did show up the second time? might be inconsistent [18:53:56] target=https://[10.2.2.5]:443/w/health-check.php msg="Error for HTTP request" err="Get \"https://10.2.2.5:443/w/health-check.php\": context deadline exceeded" [18:53:58] yeah it should show up [18:54:18] godog is the timeout set at 2.5s? [18:54:19] it does for me now fwiw, I see from the dashboard that videoscaler is flapping [18:54:23] msg="Probe failed" duration_seconds=2.500589415 [18:54:34] volans: IIRC yes [18:55:47] so yeah looks like videoscaler is slow according to the probe anyways, not sure we're in trouble tho [18:56:00] some seem to take as long 7secs in my brief testing with curl [18:56:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:56:28] it switched to zotero? [18:56:43] [question] those new alerts, did were live without the page bit before? or we were just collecting the data? [18:57:01] volans: they were live, though as warnings [18:57:15] 10SRE: Logrotate fails for: "$FILE No such file or directory" - https://phabricator.wikimedia.org/T153940 (10Umherirrender) [18:57:16] do we have any stats on how much they were/will be firing? [18:57:50] because they seem (at least so far) much more sensitive than the existing ones AFAICT [18:57:51] yes we have the logs, though looking back in the dashboards also will say [18:58:25] agreed, I'll revert the paging change for now and ask questions later [18:59:24] "revert" actually I'll switch to critical from paging [18:59:30] 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Ladsgroup) FWIW: |db2076|sanitarium master for s6 |db2086|core multiinsace: s7, s8 |db2107|candidate master for s2 |db2137|core multiinsace: s4, s5 |db2143|x2 replica |db2147|s4 replica A... [18:59:31] SGTM [18:59:46] (03PS1) 10Filippo Giunchedi: sre: move network probes to critical from page [alerts] - 10https://gerrit.wikimedia.org/r/777429 [18:59:48] +1 thanks godog [18:59:57] just ignore the page bit for now should be enough to get a generic sense of how they go for now [19:00:06] *ignoring [19:00:26] agreed, sorry folks didn't begin as smooth as I planned! [19:00:36] (03CR) 10Volans: [C: 03+1] "LGTM (logically, I'm not that familiar with the current abstraction)" [alerts] - 10https://gerrit.wikimedia.org/r/777429 (owner: 10Filippo Giunchedi) [19:01:35] (03PS6) 10Bking: elastic: don't wait for green on first node [software/spicerack] - 10https://gerrit.wikimedia.org/r/776999 (https://phabricator.wikimedia.org/T304570) [19:02:23] godog: no problem, there are always bumps on the road to making things better! [19:02:49] heheh indeed jhathaway, thank you [19:03:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab100[3|4] and gitlab-runner100[2|3|4] - https://phabricator.wikimedia.org/T301177 (10Jclark-ctr) [19:03:46] (03CR) 10Herron: [C: 03+1] sre: move network probes to critical from page [alerts] - 10https://gerrit.wikimedia.org/r/777429 (owner: 10Filippo Giunchedi) [19:03:53] (03PS2) 10Filippo Giunchedi: sre: move network probes to critical from page [alerts] - 10https://gerrit.wikimedia.org/r/777429 [19:04:04] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] sre: move network probes to critical from page [alerts] - 10https://gerrit.wikimedia.org/r/777429 (owner: 10Filippo Giunchedi) [19:04:59] basically this https://i.redd.it/uziifr83woo81.jpg [19:05:27] (03PS7) 10Ryan Kemper: elastic: don't wait for green on first node [software/spicerack] - 10https://gerrit.wikimedia.org/r/776999 (https://phabricator.wikimedia.org/T304570) (owner: 10Bking) [19:05:32] PROBLEM - PHP7 jobrunner on mw1310 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [19:06:08] (03PS1) 10RLazarus: httpbb: Add force => true to properly delete the old $install_dir [puppet] - 10https://gerrit.wikimedia.org/r/777430 (https://phabricator.wikimedia.org/T299705) [19:06:52] waiting a few seconds for the change to be rolled out to the prometheus hosts then going afk [19:07:36] RECOVERY - PHP7 jobrunner on mw1310 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 7.360 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [19:08:59] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp6009.drmrs.wmnet [19:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:05] (03CR) 10RLazarus: [C: 03+2] httpbb: Add force => true to properly delete the old $install_dir [puppet] - 10https://gerrit.wikimedia.org/r/777430 (https://phabricator.wikimedia.org/T299705) (owner: 10RLazarus) [19:09:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10wiki_willy) [19:09:54] RECOVERY - Check systemd state on webperf1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:14:23] (03CR) 10jerkins-bot: [V: 04-1] elastic: don't wait for green on first node [software/spicerack] - 10https://gerrit.wikimedia.org/r/776999 (https://phabricator.wikimedia.org/T304570) (owner: 10Bking) [19:16:18] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6009.drmrs.wmnet [19:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:17:24] 10SRE: should we make privatewiki list available to puppet without maintaining two lists? - https://phabricator.wikimedia.org/T152100 (10Umherirrender) [19:22:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:22:51] godog: can you remove the # page? [19:24:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:24:37] XioNoX: I think I see where to remove it, giving it a try [19:24:55] rzl: thanks happy to review the change [19:26:25] (03PS1) 10RLazarus: sre: Remove paging hotword from network probes [alerts] - 10https://gerrit.wikimedia.org/r/777432 [19:28:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T298565)', diff saved to https://phabricator.wikimedia.org/P24129 and previous config saved to /var/cache/conftool/dbconfig/20220405-192800-ladsgroup.json [19:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:05] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [19:29:03] XioNoX: ready, https://gerrit.wikimedia.org/r/777432 [19:29:12] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp6010.drmrs.wmnet [19:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:29:47] rzl: lgtm! [19:29:49] (03CR) 10Ayounsi: [C: 03+1] sre: Remove paging hotword from network probes [alerts] - 10https://gerrit.wikimedia.org/r/777432 (owner: 10RLazarus) [19:30:30] (03CR) 10RLazarus: [C: 03+2] sre: Remove paging hotword from network probes [alerts] - 10https://gerrit.wikimedia.org/r/777432 (owner: 10RLazarus) [19:32:39] (03Merged) 10jenkins-bot: sre: Remove paging hotword from network probes [alerts] - 10https://gerrit.wikimedia.org/r/777432 (owner: 10RLazarus) [19:33:14] manually running puppet on P:alerts::deploy::prometheus to avoid bothering folks over the next 30 min [19:35:03] thanls rzl [19:35:16] (03CR) 10Herron: "Looks good to me overall, will be great to shed these legacy addresses. Please see question inline, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/777423 (https://phabricator.wikimedia.org/T280472) (owner: 10JHathaway) [19:36:36] done 🤞 [19:36:38] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6010.drmrs.wmnet [19:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:57] (03PS1) 10Zabe: postgresql: migrate backup crons to systemd timer jobs [puppet] - 10https://gerrit.wikimedia.org/r/777433 (https://phabricator.wikimedia.org/T273673) [19:38:01] (03PS1) 10Zabe: postgresql: remove absented backup crons [puppet] - 10https://gerrit.wikimedia.org/r/777434 (https://phabricator.wikimedia.org/T273673) [19:38:24] (03CR) 10jerkins-bot: [V: 04-1] postgresql: migrate backup crons to systemd timer jobs [puppet] - 10https://gerrit.wikimedia.org/r/777433 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [19:38:30] (03CR) 10JHathaway: mx: reject email to legacy mailing list domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777423 (https://phabricator.wikimedia.org/T280472) (owner: 10JHathaway) [19:38:42] (03CR) 10jerkins-bot: [V: 04-1] postgresql: remove absented backup crons [puppet] - 10https://gerrit.wikimedia.org/r/777434 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [19:41:01] (03PS2) 10Zabe: postgresql: migrate backup crons to systemd timer jobs [puppet] - 10https://gerrit.wikimedia.org/r/777433 (https://phabricator.wikimedia.org/T273673) [19:42:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:43:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P24130 and previous config saved to /var/cache/conftool/dbconfig/20220405-194305-ladsgroup.json [19:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:18] ^ dropping the hotword worked, at least [19:46:07] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp6011.drmrs.wmnet [19:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:49:48] !log rzl@cumin2002 conftool action : set/pooled=no; selector: cluster=videoscaler,name=mw(1307|1308|1309|1310|1311|1318|1334|1335|1336|1337).* [19:49:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:14] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6011.drmrs.wmnet [19:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:17] (03PS3) 10Zabe: postgresql: migrate backup crons to systemd timer jobs [puppet] - 10https://gerrit.wikimedia.org/r/777433 (https://phabricator.wikimedia.org/T273673) [19:56:56] (03PS4) 10Zabe: postgresql: migrate backup crons to systemd timer jobs [puppet] - 10https://gerrit.wikimedia.org/r/777433 (https://phabricator.wikimedia.org/T273673) [19:57:32] I'm trying to figure out how to make the deployment server kick me out after 15 minutes of inactivity. But TMOUT is set readonly, to much higher value... is there a way to lower it? [19:58:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P24131 and previous config saved to /var/cache/conftool/dbconfig/20220405-195810-ladsgroup.json [19:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:24] duesen: if you find a way to lower it, let me know so I can use the same trick to raise it for my user. ;) [20:00:05] RoanKattouw and Urbanecm: That opportune time is upon us again. Time for a UTC late backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220405T2000). [20:00:05] eigyan: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:11] greetings all [20:00:20] bd808: raising it is easy: exec env TMOUT=0 bash [20:00:56] Hi eigyan ! I'll be ready to do the deployment in about 15 minutes, apologies for the delay [20:01:02] (03CR) 10Zabe: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1003/34715/" [puppet] - 10https://gerrit.wikimedia.org/r/777433 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [20:01:08] take your time RoanKattouw [20:01:12] :) [20:01:17] It's lunchtime and I need to eat :) [20:02:08] ...so, considering that it's actually easy to raise the TMOUT, can we just not make it readonly, so I can also lower it?... [20:02:49] oh wait, actually, does this work? exec env TMOUT=300 bash && exit [20:02:53] let me try that [20:04:34] I added two patches to the window, hope thats ok [20:05:11] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp6012.drmrs.wmnet [20:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:16] zabe: or there may be mices around :P [20:05:16] 10SRE, 10Data-Engineering, 10Product-Analytics, 10wmfdata-python: wmfdata.mariadb relies on analytics-mysql being available - https://phabricator.wikimedia.org/T292479 (10JArguello-WMF) [20:05:39] heh, that does work. nice. [20:05:52] :p [20:06:14] I'll just put that into my .profile, then :P [20:13:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T298565)', diff saved to https://phabricator.wikimedia.org/P24132 and previous config saved to /var/cache/conftool/dbconfig/20220405-201315-ladsgroup.json [20:13:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [20:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [20:13:19] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:23] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6012.drmrs.wmnet [20:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:37] * urbanecm waves [20:18:44] eigyan: RoanKattouw: did deployment already start? [20:18:50] (if not, i can do it) [20:18:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10Cmjohnson) @cmooney Can you confirm the raid setup please. analytics-flex is first 2 ssds are raid 1 and the rest jbod? [20:23:37] @urba [20:23:41] (03CR) 10Dzahn: postgresql: migrate backup crons to systemd timer jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777433 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [20:23:54] urbanecm I don't think so [20:24:04] in that case, let's do it [20:24:06] hello eigyan [20:24:10] sorry for the lateness [20:24:32] and hello zabe! just saw you have a patch too [20:24:33] Greetings urbanecm RoanKattouw was just having some lunch [20:24:43] yep, saw that in the scrollback [20:24:49] I'll deploy today [20:24:59] hey [20:25:01] urbanecm perfect...lets rock! [20:25:11] (03PS1) 10RLazarus: httpbb: Clean up absented objects [puppet] - 10https://gerrit.wikimedia.org/r/777442 (https://phabricator.wikimedia.org/T299705) [20:25:23] (03CR) 10Urbanecm: [C: 03+2] [config]: Undeploy GDI survey from EN,FR and ES wikis in PROD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777416 (https://phabricator.wikimedia.org/T303962) (owner: 10Eigyan) [20:26:06] (03Merged) 10jenkins-bot: [config]: Undeploy GDI survey from EN,FR and ES wikis in PROD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777416 (https://phabricator.wikimedia.org/T303962) (owner: 10Eigyan) [20:26:18] (03PS2) 10RLazarus: httpbb: Clean up absented objects [puppet] - 10https://gerrit.wikimedia.org/r/777442 (https://phabricator.wikimedia.org/T299705) [20:27:11] (03CR) 10Dzahn: [C: 03+1] httpbb: Clean up absented objects [puppet] - 10https://gerrit.wikimedia.org/r/777442 (https://phabricator.wikimedia.org/T299705) (owner: 10RLazarus) [20:27:27] eigyan: i pulled it to mwdebug1001. can you test please? [20:27:46] (03CR) 10Dzahn: [C: 03+1] "assuming puppet already ran on everything that could have this" [puppet] - 10https://gerrit.wikimedia.org/r/777442 (https://phabricator.wikimedia.org/T299705) (owner: 10RLazarus) [20:27:46] urbanecm testing now thanks! [20:27:50] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp6013.drmrs.wmnet [20:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:56] Thanks urbanecm ! I just got back from lunch but I see you've got it [20:28:20] no problem RoanKattouw. i hope the lunch was good :) [20:30:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:42] (03CR) 10Dzahn: [C: 03+1] "really checked with 'sudo cumin 'C:profile::httpbb' 'file /srv/deployment/httpbb'' 4 hosts and none have the dir :)" [puppet] - 10https://gerrit.wikimedia.org/r/777442 (https://phabricator.wikimedia.org/T299705) (owner: 10RLazarus) [20:31:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:33] urbanecm patch is working as expected. Thanks! [20:32:38] syncing [20:33:35] (03PS3) 10Urbanecm: tests: rename $wmfConfigDir to $configDir [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776253 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:33:40] (03CR) 10Urbanecm: [C: 03+2] tests: rename $wmfConfigDir to $configDir [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776253 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:33:58] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 10c16c5ed46014ec6f5e771f84320441974bef6c: [config]: Undeploy GDI survey from EN,FR and ES wikis in PROD (T303962) (duration: 00m 55s) [20:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:01] T303962: Undeploy Safety Survey for EN, ES, FR wikis FROM PRODUCTION - https://phabricator.wikimedia.org/T303962 [20:34:04] eigyan: and live [20:34:06] anything else? [20:34:21] (03Merged) 10jenkins-bot: tests: rename $wmfConfigDir to $configDir [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776253 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:34:25] urbanecm I am all good here, thank you so much! [20:34:32] happy to help! [20:34:41] (03CR) 10RLazarus: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/777442 (https://phabricator.wikimedia.org/T299705) (owner: 10RLazarus) [20:34:44] zabe: first patch only changes tests, so no testing etc. needed [20:34:47] looking at the second one [20:35:12] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6013.drmrs.wmnet [20:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:37] 10SRE, 10serviceops, 10Patch-For-Review: Debian package for httpbb - https://phabricator.wikimedia.org/T299705 (10RLazarus) 05Open→03Resolved [20:35:42] 10SRE, 10Wikimedia-Apache-configuration, 10serviceops: Build a black-box httpd testing framework - https://phabricator.wikimedia.org/T236699 (10RLazarus) [20:37:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:06] (03PS1) 10Jdlrobson: Update to 78eef14, rename viewportSize to viewportSizeBucket [extensions/WikimediaEvents] (wmf/1.39.0-wmf.6) - 10https://gerrit.wikimedia.org/r/777389 (https://phabricator.wikimedia.org/T301391) [20:38:08] (03PS2) 10Urbanecm: Change upload dialog automatic upload comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776323 (https://phabricator.wikimedia.org/T305303) (owner: 10Zabe) [20:38:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:38:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:16] (03CR) 10Urbanecm: [C: 03+2] Change upload dialog automatic upload comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776323 (https://phabricator.wikimedia.org/T305303) (owner: 10Zabe) [20:38:41] (03CR) 10Jdlrobson: "Jan: Any chance you would be able to backport this tomorrow? We were hoping to go into next week with some data to help make a decision ar" [extensions/WikimediaEvents] (wmf/1.39.0-wmf.6) - 10https://gerrit.wikimedia.org/r/777389 (https://phabricator.wikimedia.org/T301391) (owner: 10Jdlrobson) [20:39:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:31] (03Merged) 10jenkins-bot: Change upload dialog automatic upload comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776323 (https://phabricator.wikimedia.org/T305303) (owner: 10Zabe) [20:40:31] zabe: your patch is at mwdebug1001 [20:40:33] please test [20:41:00] doing [20:41:12] !log deploying refinery for https://gerrit.wikimedia.org/r/c/analytics/refinery/+/776269/ [20:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:12] urbanecm, lgtm: https://commons.wikimedia.org/w/index.php?title=File:Hi-776323.png&action=history [20:45:24] looks good too [20:45:26] syncing [20:47:19] !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: 8ea86349017e71dcd38bde0663cfb13e86fe127c: Change upload dialog automatic upload comments (T305303) (duration: 00m 54s) [20:47:21] zabe: it's live [20:47:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:23] anything else? [20:47:25] T305303: Change upload dialog edit summary on Commons - https://phabricator.wikimedia.org/T305303 [20:47:29] !log puppetmaster1001 - running test downloads of geoip databases to a temp dir [20:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:47] no, thx :) [20:48:05] okay :) [20:48:08] then we're done [20:48:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:13] !log UTC late B&C window done [20:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:17] !log razzi@deploy1002 Started deploy [analytics/refinery@fd8b410]: Regular analytics weekly train [analytics/refinery@fd8b410] [20:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:22] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp6014.drmrs.wmnet [20:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [20:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [20:58:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [20:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [20:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T298565)', diff saved to https://phabricator.wikimedia.org/P24133 and previous config saved to /var/cache/conftool/dbconfig/20220405-205822-ladsgroup.json [20:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:25] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:02:37] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6014.drmrs.wmnet [21:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:07] !log razzi@deploy1002 Finished deploy [analytics/refinery@fd8b410]: Regular analytics weekly train [analytics/refinery@fd8b410] (duration: 22m 50s) [21:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:12] !log razzi@deploy1002 Started deploy [analytics/refinery@fd8b410] (thin): Regular analytics weekly train THIN [analytics/refinery@fd8b410] [21:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:23] !log razzi@deploy1002 Finished deploy [analytics/refinery@fd8b410] (thin): Regular analytics weekly train THIN [analytics/refinery@fd8b410] (duration: 00m 10s) [21:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:25] !log razzi@deploy1002 Started deploy [analytics/refinery@fd8b410] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@fd8b410] [21:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:01] (03CR) 10Dzahn: [C: 03+2] puppetmaster:geoip: stop trying to download GeoIP1 legacy databases [puppet] - 10https://gerrit.wikimedia.org/r/773843 (https://phabricator.wikimedia.org/T303464) (owner: 10Dzahn) [21:20:23] (03CR) 10Dzahn: [C: 03+2] "tested with a manual update run, no files are being removed by this" [puppet] - 10https://gerrit.wikimedia.org/r/773843 (https://phabricator.wikimedia.org/T303464) (owner: 10Dzahn) [21:21:13] !log razzi@deploy1002 Finished deploy [analytics/refinery@fd8b410] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@fd8b410] (duration: 06m 48s) [21:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:03] (03PS1) 10Zabe: extdist: change response code from 302 to 301 [puppet] - 10https://gerrit.wikimedia.org/r/777446 [21:26:12] (03CR) 10Krinkle: [C: 03+1] "Good to go!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776254 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [21:26:56] (03PS2) 10Zabe: extdist: change response code from 302 to 301 [puppet] - 10https://gerrit.wikimedia.org/r/777446 [21:41:53] (03PS8) 10Ryan Kemper: elastic: don't wait for green on first node [software/spicerack] - 10https://gerrit.wikimedia.org/r/776999 (https://phabricator.wikimedia.org/T304570) (owner: 10Bking) [21:45:09] 10SRE, 10Data-Engineering, 10Traffic, 10Trust-and-Safety, 10serviceops: Disable GeoIP Legacy Download - https://phabricator.wikimedia.org/T303464 (10Dzahn) [21:50:30] (03CR) 10jerkins-bot: [V: 04-1] elastic: don't wait for green on first node [software/spicerack] - 10https://gerrit.wikimedia.org/r/776999 (https://phabricator.wikimedia.org/T304570) (owner: 10Bking) [21:54:59] 10SRE, 10Data-Engineering, 10Traffic, 10Trust-and-Safety, 10serviceops: Disable GeoIP Legacy Download - https://phabricator.wikimedia.org/T303464 (10Dzahn) > Modify the puppet code to no longer download the databases from MaxMind and then propagate to other servers/destinations. This is done. puppet c... [21:57:33] 10SRE, 10Data-Engineering, 10Traffic, 10Trust-and-Safety, 10serviceops: Disable GeoIP Legacy Download / Identify all users of legacy (v1) GeoIP datasets and inform them of the need to switch to GeoIP2 dataset - https://phabricator.wikimedia.org/T303464 (10Dzahn) a:05Dzahn→03None [21:58:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T298565)', diff saved to https://phabricator.wikimedia.org/P24135 and previous config saved to /var/cache/conftool/dbconfig/20220405-215837-ladsgroup.json [21:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:41] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:58:47] (03PS5) 10Zabe: postgresql: migrate backup crons to systemd timer jobs [puppet] - 10https://gerrit.wikimedia.org/r/777433 (https://phabricator.wikimedia.org/T273673) [22:01:46] 10SRE: prometheus: ganglia-gen and outdated Ganglia:cluster resource name - https://phabricator.wikimedia.org/T186918 (10Dzahn) The file mentioned was removed in T253555 / https://gerrit.wikimedia.org/r/c/operations/puppet/+/609131 [22:02:54] 10SRE, 10Analytics, 10Traffic, 10Patch-For-Review: Remove ganglia leftovers from ops/puppet - https://phabricator.wikimedia.org/T253555 (10Dzahn) [22:03:00] 10SRE: prometheus: ganglia-gen and outdated Ganglia:cluster resource name - https://phabricator.wikimedia.org/T186918 (10Dzahn) [22:03:10] PROBLEM - SSH on wtp1041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:03:39] (03CR) 10Zabe: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1003/34717/" [puppet] - 10https://gerrit.wikimedia.org/r/777433 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [22:03:40] zabe: thank you! please give it .sh file extension. That would add CI checks for shell scripts [22:04:03] ok [22:06:30] (03PS6) 10Zabe: postgresql: migrate backup crons to systemd timer jobs [puppet] - 10https://gerrit.wikimedia.org/r/777433 (https://phabricator.wikimedia.org/T273673) [22:07:18] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Epic: Move most (all?) exim personal aliases to WMF ITS - https://phabricator.wikimedia.org/T122144 (10Dzahn) [22:08:31] mutante, done [22:08:33] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10Dzahn) 05Open→03Stalled ACK, I am setting this to stalled until May. [22:09:43] (03CR) 10Zabe: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1001/34718/" [puppet] - 10https://gerrit.wikimedia.org/r/777433 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [22:12:25] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:13:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P24136 and previous config saved to /var/cache/conftool/dbconfig/20220405-221342-ladsgroup.json [22:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:00] (03PS2) 10Zabe: postgresql: remove absented backup crons [puppet] - 10https://gerrit.wikimedia.org/r/777434 (https://phabricator.wikimedia.org/T273673) [22:14:40] zabe: ack, thanks! [22:18:10] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10Dzahn) [22:28:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P24137 and previous config saved to /var/cache/conftool/dbconfig/20220405-222847-ladsgroup.json [22:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:14] (03PS1) 10Zabe: zookeeper: migrate zookeeper-cleanup cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/777451 (https://phabricator.wikimedia.org/T273673) [22:29:16] (03PS1) 10Zabe: zookeeper: remove absented zookeeper-cleanup cron [puppet] - 10https://gerrit.wikimedia.org/r/777452 (https://phabricator.wikimedia.org/T273673) [22:29:47] (03CR) 10jerkins-bot: [V: 04-1] zookeeper: migrate zookeeper-cleanup cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/777451 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [22:31:01] (03PS2) 10Zabe: zookeeper: migrate zookeeper-cleanup cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/777451 (https://phabricator.wikimedia.org/T273673) [22:37:46] (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs2003:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [22:43:10] (03PS1) 10Zabe: toil: migrate systemd_scope_cleanup cron to a systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/777453 (https://phabricator.wikimedia.org/T273673) [22:43:12] (03PS1) 10Zabe: toil: remove absented systemd_scope_cleanup cron [puppet] - 10https://gerrit.wikimedia.org/r/777454 (https://phabricator.wikimedia.org/T273673) [22:43:43] (03CR) 10jerkins-bot: [V: 04-1] toil: migrate systemd_scope_cleanup cron to a systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/777453 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [22:43:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T298565)', diff saved to https://phabricator.wikimedia.org/P24138 and previous config saved to /var/cache/conftool/dbconfig/20220405-224352-ladsgroup.json [22:43:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [22:43:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [22:43:58] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [22:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:01] (03CR) 10jerkins-bot: [V: 04-1] toil: remove absented systemd_scope_cleanup cron [puppet] - 10https://gerrit.wikimedia.org/r/777454 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [22:44:48] (03PS2) 10Zabe: toil: migrate systemd_scope_cleanup cron to a systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/777453 (https://phabricator.wikimedia.org/T273673) [22:45:39] (03PS2) 10Zabe: toil: remove absented systemd_scope_cleanup cron [puppet] - 10https://gerrit.wikimedia.org/r/777454 (https://phabricator.wikimedia.org/T273673) [22:52:05] (03PS1) 10Zabe: cinderutils: remove absented file [puppet] - 10https://gerrit.wikimedia.org/r/777456 [22:52:54] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:15:42] (03CR) 10Zabe: "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1001/34719/" [puppet] - 10https://gerrit.wikimedia.org/r/761718 (owner: 10Zabe) [23:28:17] (03PS2) 10Reedy: Use namespaced GerritExtDistProvider [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774963 [23:30:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [23:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [23:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24139 and previous config saved to /var/cache/conftool/dbconfig/20220405-233042-ladsgroup.json [23:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:47] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [23:31:14] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [23:33:20] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [23:54:00] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook