[00:01:15] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[00:01:47] <rzl>	 ^ 👀
[00:02:43] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[00:03:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P24073 and previous config saved to /var/cache/conftool/dbconfig/20220405-000355-ladsgroup.json
[00:03:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:06:59] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp1083.eqiad.wmnet
[00:07:43] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp3062.esams.wmnet
[00:08:14] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp5012.eqsin.wmnet
[00:10:07] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[00:11:17] <rzl>	 ^ I wasn't able to find out what that was, but it seems over nwo
[00:11:19] <rzl>	 *now
[00:11:53] <rzl>	 POST latency only, and appservers only (not API) without a smoking gun from any particular backend! very weird
[00:12:05] <rzl>	 I'll poke around a little more but then leave it alone, unless it recurs
[00:16:16] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1083.eqiad.wmnet
[00:16:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:16:37] <icinga-wm>	 PROBLEM - Check systemd state on gitlab1001 is CRITICAL: CRITICAL - degraded: The following units failed: full-backup.service,prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:17:11] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3062.esams.wmnet
[00:17:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:17:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[00:18:00] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5012.eqsin.wmnet
[00:18:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:19:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P24074 and previous config saved to /var/cache/conftool/dbconfig/20220405-001900-ladsgroup.json
[00:19:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:21:02] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp104[6-8].eqiad.wmnet
[00:21:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:21:25] <icinga-wm>	 PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 3210 bytes in 0.206 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring
[00:22:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[00:23:28] <mutante>	 !log wtp1046, wtp1047, wtp1048 - rebooting, one at a time
[00:23:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:26:10] <wikibugs>	 10SRE, 10Generated Data Platform, 10Image-Suggestions, 10serviceops, and 2 others: Blubber setup for Image Suggestions Service - https://phabricator.wikimedia.org/T305155 (10Dzahn) >>! In T305155#7823133, @Dzahn wrote: > port reserved:  4017 >  > https://wikitech.wikimedia.org/wiki/Kubernetes/Service_ports...
[00:27:18] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1048.eqiad.wmnet
[00:27:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:28:21] <icinga-wm>	 ACKNOWLEDGEMENT - Host wtp1047 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn reboot
[00:30:13] <icinga-wm>	 RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 121816 bytes in 0.197 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring
[00:30:41] <mutante>	 !log gitlab.wikimedia.org was down because gitlab1001 ran out of disk space. ran 'apt-get clean' to free 13G which made it recover...
[00:30:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:30:44] <mutante>	 hrmmm
[00:31:31] <icinga-wm>	 PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:32:47] <mutante>	 !log gitlab.wikimedia.org was down because gitlab1001 ran out of disk space. ran 'apt-get clean' to free 13G which made it recover... T274463 - <+icinga-wm> RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK
[00:32:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:32:50] <stashbot>	 T274463: Backups for GitLab - https://phabricator.wikimedia.org/T274463
[00:33:14] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1047.eqiad.wmnet
[00:33:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:33:20] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1046.eqiad.wmnet
[00:33:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:33:31] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4032.ulsfo.wmnet
[00:33:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:34:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T298565)', diff saved to https://phabricator.wikimedia.org/P24075 and previous config saved to /var/cache/conftool/dbconfig/20220405-003405-ladsgroup.json
[00:34:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:34:09] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[00:34:13] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance
[00:34:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:34:14] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance
[00:34:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:34:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24076 and previous config saved to /var/cache/conftool/dbconfig/20220405-003419-ladsgroup.json
[00:34:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:36:01] <mutante>	 !log gitlab2001 - apt-get clean to prevent disk space issues
[00:36:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:39:44] <mutante>	 !log gitlab1001 - mv 1648814678_2022_04_01_14.9.1_gitlab_backup.tar and other files from April 2nd/April 3rd over from /srv/gitlab-backup to /mnt/gitlab-backup to prevent another outage due to disk space T274463
[00:39:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:39:48] <stashbot>	 T274463: Backups for GitLab - https://phabricator.wikimedia.org/T274463
[00:40:58] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4032.ulsfo.wmnet
[00:41:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:41:42] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Wikimedia-Mailing-lists: mass Yahoo / AOL bounces mailman - https://phabricator.wikimedia.org/T232417 (10Effeietsanders) I suspect this ticket can now be resolved. Haven't seen recent activity.
[00:42:07] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp2042.codfw.wmnet
[00:42:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:42:38] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp1084.eqiad.wmnet
[00:42:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:43:55] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp5016.eqsin.wmnet
[00:43:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:50:26] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2042.codfw.wmnet
[00:50:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:51:03] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4034.ulsfo.wmnet
[00:51:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:51:49] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1084.eqiad.wmnet
[00:51:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:53:30] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp3063.esams.wmnet
[00:53:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:53:42] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5016.eqsin.wmnet
[00:53:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:58:43] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4034.ulsfo.wmnet
[00:58:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:00:04] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220405T0100)
[01:02:59] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3063.esams.wmnet
[01:03:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:06:58] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp5002.eqsin.wmnet
[01:06:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:07:27] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp3053.esams.wmnet
[01:07:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:15:48] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3053.esams.wmnet
[01:15:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:21:41] <wikibugs>	 (03PS1) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040)
[01:23:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[01:27:01] <icinga-wm>	 PROBLEM - Host cp5002 is DOWN: PING CRITICAL - Packet loss = 100%
[01:32:37] <icinga-wm>	 RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:33:21] <sukhe>	 yeah, the cookbook for cp5002 is stuck at [77/120, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Unable to get uptime for cp5002.eqsin.wmnet
[01:33:43] <icinga-wm>	 PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:36:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24077 and previous config saved to /var/cache/conftool/dbconfig/20220405-013609-ladsgroup.json
[01:36:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:36:13] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[01:41:15] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:42:30] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:46:56] <wikibugs>	 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul)
[01:47:25] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host cp5002.eqsin.wmnet
[01:47:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:51:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P24078 and previous config saved to /var/cache/conftool/dbconfig/20220405-015114-ladsgroup.json
[01:51:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:52:30] <wikibugs>	 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) @ayounsi I update the table with transit/transport links. Please double check. For  cr1 to cr2 I have a total of 3 links 2 on FPC3 and 1 on FPC4. My guess is the link on FPC4 is there in case FPC3...
[01:59:26] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on cp5002.eqsin.wmnet with reason: downtimed because of hardware failure: T305423
[01:59:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:59:28] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on cp5002.eqsin.wmnet with reason: downtimed because of hardware failure: T305423
[01:59:29] <stashbot>	 T305423: cp5002 memory errors on DIMM A4 - https://phabricator.wikimedia.org/T305423
[01:59:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:05:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[02:05:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:06:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P24079 and previous config saved to /var/cache/conftool/dbconfig/20220405-020619-ladsgroup.json
[02:06:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:07:33] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.39.0-wmf.6 [core] (wmf/1.39.0-wmf.6) - 10https://gerrit.wikimedia.org/r/777041
[02:07:39] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.39.0-wmf.6 [core] (wmf/1.39.0-wmf.6) - 10https://gerrit.wikimedia.org/r/777041 (owner: 10TrainBranchBot)
[02:07:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[02:07:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:07:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[02:07:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:08:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[02:08:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:21:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24080 and previous config saved to /var/cache/conftool/dbconfig/20220405-022124-ladsgroup.json
[02:21:26] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[02:21:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:21:28] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[02:21:28] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[02:21:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:21:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:21:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T298565)', diff saved to https://phabricator.wikimedia.org/P24081 and previous config saved to /var/cache/conftool/dbconfig/20220405-022132-ladsgroup.json
[02:21:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:24:34] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.39.0-wmf.6 [core] (wmf/1.39.0-wmf.6) - 10https://gerrit.wikimedia.org/r/777041 (owner: 10TrainBranchBot)
[02:29:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[02:29:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:29:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[02:29:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[02:29:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:29:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:30:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[02:30:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:37:46] <jinxer-wm>	 (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs2003:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC
[03:17:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T298565)', diff saved to https://phabricator.wikimedia.org/P24082 and previous config saved to /var/cache/conftool/dbconfig/20220405-031745-ladsgroup.json
[03:17:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:17:50] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[03:32:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P24083 and previous config saved to /var/cache/conftool/dbconfig/20220405-033251-ladsgroup.json
[03:32:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:47:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P24084 and previous config saved to /var/cache/conftool/dbconfig/20220405-034756-ladsgroup.json
[03:47:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:01:57] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_esams_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_esams_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[04:02:43] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[04:03:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T298565)', diff saved to https://phabricator.wikimedia.org/P24085 and previous config saved to /var/cache/conftool/dbconfig/20220405-040301-ladsgroup.json
[04:03:03] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance
[04:03:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:03:04] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance
[04:03:05] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[04:03:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:03:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:03:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1164 (T298565)', diff saved to https://phabricator.wikimedia.org/P24086 and previous config saved to /var/cache/conftool/dbconfig/20220405-040309-ladsgroup.json
[04:03:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:12:41] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_esams_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_esams_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[04:34:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1132 for testing T301879', diff saved to https://phabricator.wikimedia.org/P24087 and previous config saved to /var/cache/conftool/dbconfig/20220405-043426-marostegui.json
[04:34:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:34:31] <stashbot>	 T301879: Test MariaDB 10.6 on Bullseye - https://phabricator.wikimedia.org/T301879
[05:00:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T298565)', diff saved to https://phabricator.wikimedia.org/P24088 and previous config saved to /var/cache/conftool/dbconfig/20220405-050047-ladsgroup.json
[05:00:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:00:51] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[05:12:07] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 107 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[05:15:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P24089 and previous config saved to /var/cache/conftool/dbconfig/20220405-051552-ladsgroup.json
[05:15:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:17:32] <_joe_>	 !log uploading new minor version of conftool to apt for buster/bullseye (requestctl new feature)
[05:17:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:30:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P24090 and previous config saved to /var/cache/conftool/dbconfig/20220405-053057-ladsgroup.json
[05:30:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:36:03] <icinga-wm>	 PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:42:30] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:46:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T298565)', diff saved to https://phabricator.wikimedia.org/P24091 and previous config saved to /var/cache/conftool/dbconfig/20220405-054602-ladsgroup.json
[05:46:04] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[05:46:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:46:05] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[05:46:07] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[05:46:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:46:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:46:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T298565)', diff saved to https://phabricator.wikimedia.org/P24092 and previous config saved to /var/cache/conftool/dbconfig/20220405-054610-ladsgroup.json
[05:46:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:52:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1132 for testing T301879', diff saved to https://phabricator.wikimedia.org/P24093 and previous config saved to /var/cache/conftool/dbconfig/20220405-055256-marostegui.json
[05:52:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:53:00] <stashbot>	 T301879: Test MariaDB 10.6 on Bullseye - https://phabricator.wikimedia.org/T301879
[06:00:05] <jouncebot>	 kormat, marostegui, and Amir1: My dear minions, it's time we take the moon! Just kidding. Time for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220405T0600).
[06:01:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1132 into API for testing T301879', diff saved to https://phabricator.wikimedia.org/P24094 and previous config saved to /var/cache/conftool/dbconfig/20220405-060124-marostegui.json
[06:01:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:01:30] <stashbot>	 T301879: Test MariaDB 10.6 on Bullseye - https://phabricator.wikimedia.org/T301879
[06:06:01] <icinga-wm>	 PROBLEM - LVS zotero codfw port 4969/tcp - Zotero- zotero.svc.codfw.wmnet IPv4 on zotero.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[06:08:07] <icinga-wm>	 RECOVERY - LVS zotero codfw port 4969/tcp - Zotero- zotero.svc.codfw.wmnet IPv4 on zotero.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 196 bytes in 1.140 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[06:12:25] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[06:21:05] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/776982 (https://phabricator.wikimedia.org/T305403) (owner: 10Herron)
[06:36:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'More weight to db1132 T301879', diff saved to https://phabricator.wikimedia.org/P24095 and previous config saved to /var/cache/conftool/dbconfig/20220405-063648-marostegui.json
[06:36:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:36:52] <stashbot>	 T301879: Test MariaDB 10.6 on Bullseye - https://phabricator.wikimedia.org/T301879
[06:37:13] <icinga-wm>	 RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:37:46] <jinxer-wm>	 (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs2003:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC
[06:50:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T298565)', diff saved to https://phabricator.wikimedia.org/P24096 and previous config saved to /var/cache/conftool/dbconfig/20220405-065053-ladsgroup.json
[06:50:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:50:58] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[06:58:20] <wikibugs>	 10SRE, 10Wikimedia-SVG-rendering: Adding new font for CJK media display - https://phabricator.wikimedia.org/T280432 (10NFSL2001) @MoritzMuehlenhoff Any updates for this? The preview character is still missisng and problamatic to Chinese users.  Here is a screenshot of how it should look for last 2 elements: {F...
[07:00:05] <jouncebot>	 Amir1, awight, Urbanecm, and taavi: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220405T0700).
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:00:22] <taavi>	 o/ looks like nothing to do
[07:04:46] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:05:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P24097 and previous config saved to /var/cache/conftool/dbconfig/20220405-070558-ladsgroup.json
[07:06:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:10:20] <wikibugs>	 (03PS1) 10Ayounsi: sflow: fix pre_tag2_filter [puppet] - 10https://gerrit.wikimedia.org/r/777292 (https://phabricator.wikimedia.org/T263277)
[07:21:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P24098 and previous config saved to /var/cache/conftool/dbconfig/20220405-072103-ladsgroup.json
[07:21:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:21:52] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] sre.SREBatchBase: additional customizations [cookbooks] - 10https://gerrit.wikimedia.org/r/776965 (owner: 10Volans)
[07:23:27] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] sre.cdn.roll-restart-varnish: improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/776966 (owner: 10Volans)
[07:28:00] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (Kanban): PXE boot failures on cloudvirt-wdqs100[1-3] - https://phabricator.wikimedia.org/T305368 (10ayounsi) 05Open→03Resolved a:05Andrew→03ayounsi Fix merged.
[07:36:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T298565)', diff saved to https://phabricator.wikimedia.org/P24099 and previous config saved to /var/cache/conftool/dbconfig/20220405-073608-ladsgroup.json
[07:36:10] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance
[07:36:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:36:12] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance
[07:36:12] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[07:36:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:36:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:36:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T298565)', diff saved to https://phabricator.wikimedia.org/P24100 and previous config saved to /var/cache/conftool/dbconfig/20220405-073617-ladsgroup.json
[07:36:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:37:44] <wikibugs>	 (03PS1) 10Jaime Nuche: testwikis wikis to 1.39.0-wmf.6  refs T305212 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777294
[07:37:46] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+2] testwikis wikis to 1.39.0-wmf.6  refs T305212 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777294 (owner: 10Jaime Nuche)
[07:38:25] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.39.0-wmf.6  refs T305212 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777294 (owner: 10Jaime Nuche)
[07:38:27] <logmsgbot>	 !log jnuche@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.6  refs T305212
[07:38:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:38:30] <stashbot>	 T305212: 1.39.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T305212
[07:39:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:39:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:40:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:40:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:40:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:40:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:40:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:40:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:45:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:45:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:46:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:46:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:46:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:46:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:47:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:47:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:48:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10ayounsi) See comment in T303424#7830897, they *might* be able to go in any 10G rack, private vlan.  Regardless, those are prod hosts (public/pr...
[07:51:13] <icinga-wm>	 PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:52:58] <XioNoX>	 !log disable BGP to Tata in drmrs for circuit move - T298208
[07:52:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:53:19] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Update cert-manager to 1.5.4-3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/776971 (https://phabricator.wikimedia.org/T304092) (owner: 10JMeybohm)
[07:57:36] <wikibugs>	 (03Merged) 10jenkins-bot: Update cert-manager to 1.5.4-3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/776971 (https://phabricator.wikimedia.org/T304092) (owner: 10JMeybohm)
[08:00:04] <jouncebot>	 jnuche and hashar: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220405T0800).
[08:02:43] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[08:12:00] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[08:12:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:12:49] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[08:12:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:13] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[08:13:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:59] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[08:14:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:14:17] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Change the Calico's pod IP subnet for ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/776877 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey)
[08:14:25] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Change the Calico's pod IP subnet for ml-serve-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/776876 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey)
[08:14:34] <wikibugs>	 (03PS1) 10Daniel Kinzler: Add ~daniel/.profile [puppet] - 10https://gerrit.wikimedia.org/r/777298
[08:14:53] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "fileset change seems good" [puppet] - 10https://gerrit.wikimedia.org/r/776230 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[08:15:12] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] role::ml_k8s::master: change the svc eqiad IP subnet [puppet] - 10https://gerrit.wikimedia.org/r/776879 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey)
[08:15:29] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] role::ml_k8s::master: change the codfw svc IP range [puppet] - 10https://gerrit.wikimedia.org/r/776880 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey)
[08:19:21] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host dragonfly-supernode2001.codfw.wmnet
[08:19:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:21] <logmsgbot>	 !log jnuche@deploy1002 Finished scap: testwikis wikis to 1.39.0-wmf.6  refs T305212 (duration: 42m 53s)
[08:21:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:24] <stashbot>	 T305212: 1.39.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T305212
[08:23:10] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudgw1001.eqiad.wmnet with OS bullseye
[08:23:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:26:13] <wikibugs>	 (03PS1) 10Jaime Nuche: group0 wikis to 1.39.0-wmf.6  refs T305212 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777299
[08:26:16] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+2] group0 wikis to 1.39.0-wmf.6  refs T305212 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777299 (owner: 10Jaime Nuche)
[08:26:50] <logmsgbot>	 !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host dragonfly-supernode2001.codfw.wmnet
[08:26:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:26:56] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.6  refs T305212 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777299 (owner: 10Jaime Nuche)
[08:27:03] <icinga-wm>	 PROBLEM - Check systemd state on dragonfly-supernode2001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:28:08] <logmsgbot>	 !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudgw1001.eqiad.wmnet with OS bullseye
[08:28:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:28:50] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Define the DATHUB_SECRET value [deployment-charts] - 10https://gerrit.wikimedia.org/r/776954 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis)
[08:29:12] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Remove test hosts from the JVM heap memory alerts [alerts] - 10https://gerrit.wikimedia.org/r/776919 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis)
[08:29:19] <icinga-wm>	 RECOVERY - Check systemd state on dragonfly-supernode2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:29:32] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Remove the statsv source from the VarnishkafkaNoMessages alert [alerts] - 10https://gerrit.wikimedia.org/r/776912 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis)
[08:31:23] <logmsgbot>	 !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.6  refs T305212
[08:31:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:31:26] <stashbot>	 T305212: 1.39.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T305212
[08:31:36] <wikibugs>	 (03Merged) 10jenkins-bot: Remove test hosts from the JVM heap memory alerts [alerts] - 10https://gerrit.wikimedia.org/r/776919 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis)
[08:33:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[08:33:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:34:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[08:34:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:34:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[08:34:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:34:14] <wikibugs>	 (03PS1) 10Volans: homer: adjust the daily diff start time [puppet] - 10https://gerrit.wikimedia.org/r/777300
[08:34:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T298565)', diff saved to https://phabricator.wikimedia.org/P24101 and previous config saved to /var/cache/conftool/dbconfig/20220405-083423-ladsgroup.json
[08:34:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:34:27] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[08:35:02] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudgw1001.eqiad.wmnet with OS bullseye
[08:35:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:35:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[08:35:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:01] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[08:41:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:55] <wikibugs>	 (03PS3) 10Jcrespo: dbbackups: Setup a valid_sections.txt config for db backup checks [puppet] - 10https://gerrit.wikimedia.org/r/776969 (https://phabricator.wikimedia.org/T301315)
[08:45:50] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] dbbackups: Setup a valid_sections.txt config for db backup checks [puppet] - 10https://gerrit.wikimedia.org/r/776969 (https://phabricator.wikimedia.org/T301315) (owner: 10Jcrespo)
[08:46:30] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw1001.eqiad.wmnet with reason: host reimage
[08:46:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:49:19] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw1001.eqiad.wmnet with reason: host reimage
[08:49:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:49:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P24102 and previous config saved to /var/cache/conftool/dbconfig/20220405-084928-ladsgroup.json
[08:49:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:16] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[08:52:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:21] <wikibugs>	 (03PS1) 10JMeybohm: Move kubestagemaster* to bullseye and upstream docker [puppet] - 10https://gerrit.wikimedia.org/r/777309 (https://phabricator.wikimedia.org/T305435)
[08:58:19] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10Majavah) Cloudwebs need access to various OS APIs. Most of them are hosted in the production realm and should be accessible from any production...
[09:04:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P24103 and previous config saved to /var/cache/conftool/dbconfig/20220405-090434-ladsgroup.json
[09:04:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:05:57] <wikibugs>	 (03PS1) 10JMeybohm: Move kubemaster2002 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/777310 (https://phabricator.wikimedia.org/T305435)
[09:05:59] <wikibugs>	 (03PS1) 10JMeybohm: Move kubemaster2001 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/777311 (https://phabricator.wikimedia.org/T305435)
[09:07:18] <wikibugs>	 (03PS3) 10Jcrespo: dbbackups: Monitor db_inventory rather than zarcillo section [puppet] - 10https://gerrit.wikimedia.org/r/776170 (https://phabricator.wikimedia.org/T301315)
[09:07:20] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Update valid_sections.txt permissions to be world-readable [puppet] - 10https://gerrit.wikimedia.org/r/777312 (https://phabricator.wikimedia.org/T301315)
[09:07:34] <wikibugs>	 (03PS2) 10Jcrespo: dbbackups: Update valid_sections.txt permissions to be world-readable [puppet] - 10https://gerrit.wikimedia.org/r/777312 (https://phabricator.wikimedia.org/T301315)
[09:07:51] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34691/console" [puppet] - 10https://gerrit.wikimedia.org/r/777309 (https://phabricator.wikimedia.org/T305435) (owner: 10JMeybohm)
[09:08:46] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/777298 (owner: 10Daniel Kinzler)
[09:08:52] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34692/console" [puppet] - 10https://gerrit.wikimedia.org/r/777310 (https://phabricator.wikimedia.org/T305435) (owner: 10JMeybohm)
[09:08:54] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34693/console" [puppet] - 10https://gerrit.wikimedia.org/r/777311 (https://phabricator.wikimedia.org/T305435) (owner: 10JMeybohm)
[09:09:33] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is CRITICAL: 18.33 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[09:10:02] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] dbbackups: Update valid_sections.txt permissions to be world-readable [puppet] - 10https://gerrit.wikimedia.org/r/777312 (https://phabricator.wikimedia.org/T301315) (owner: 10Jcrespo)
[09:11:10] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw1001: hieradata: refresh NIC names [puppet] - 10https://gerrit.wikimedia.org/r/777313 (https://phabricator.wikimedia.org/T304598)
[09:11:15] <logmsgbot>	 !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: Revert "group0 wikis to 1.39.0-wmf.6"
[09:11:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:11:45] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[09:12:00] <wikibugs>	 (03PS1) 10Btullis: Allow wikikube staging pod range to access kafka eqiad-test cluster [puppet] - 10https://gerrit.wikimedia.org/r/777314 (https://phabricator.wikimedia.org/T303049)
[09:12:11] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw1001: hieradata: refresh NIC names [puppet] - 10https://gerrit.wikimedia.org/r/777313 (https://phabricator.wikimedia.org/T304598) (owner: 10Arturo Borrero Gonzalez)
[09:12:50] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[09:12:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:12:58] <wikibugs>	 (03PS1) 10Jaime Nuche: Revert "group0 wikis to 1.39.0-wmf.6  refs T305212" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777315
[09:13:00] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+2] Revert "group0 wikis to 1.39.0-wmf.6  refs T305212" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777315 (owner: 10Jaime Nuche)
[09:13:15] <wikibugs>	 (03PS1) 10Majavah: dynamicproxy: remove support for x-novaproxy-edit-dns [puppet] - 10https://gerrit.wikimedia.org/r/777316 (https://phabricator.wikimedia.org/T295246)
[09:13:22] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34694/console" [puppet] - 10https://gerrit.wikimedia.org/r/777314 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis)
[09:13:44] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "group0 wikis to 1.39.0-wmf.6  refs T305212" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777315 (owner: 10Jaime Nuche)
[09:14:41] <wikibugs>	 (03PS1) 10MMandere: site: Reimage cp5015 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777317 (https://phabricator.wikimedia.org/T290005)
[09:14:43] <wikibugs>	 (03PS1) 10MMandere: site: Reimage cp6007 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777318 (https://phabricator.wikimedia.org/T290005)
[09:14:45] <wikibugs>	 (03PS1) 10MMandere: site: Reimage cp5013 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777319 (https://phabricator.wikimedia.org/T290005)
[09:14:47] <wikibugs>	 (03PS1) 10MMandere: site: Reimage cp5007 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777320 (https://phabricator.wikimedia.org/T290005)
[09:14:49] <wikibugs>	 (03PS1) 10MMandere: site: Reimage cp5001 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777321 (https://phabricator.wikimedia.org/T290005)
[09:14:51] <wikibugs>	 (03PS1) 10MMandere: site: Reimage cp4035 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777322 (https://phabricator.wikimedia.org/T290005)
[09:14:53] <wikibugs>	 (03PS1) 10MMandere: site: Reimage cp3052 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777323 (https://phabricator.wikimedia.org/T290005)
[09:14:55] <wikibugs>	 (03PS1) 10MMandere: site: Reimage cp4027 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777324 (https://phabricator.wikimedia.org/T290005)
[09:14:57] <wikibugs>	 (03PS1) 10MMandere: site: Reimage cp4033 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777325 (https://phabricator.wikimedia.org/T290005)
[09:14:59] <wikibugs>	 (03PS1) 10MMandere: site: Reimage cp4021 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777326 (https://phabricator.wikimedia.org/T290005)
[09:15:01] <wikibugs>	 (03PS1) 10JMeybohm: Move kubemaster1002 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/777327 (https://phabricator.wikimedia.org/T305435)
[09:15:03] <wikibugs>	 (03PS1) 10JMeybohm: Move kubemaster1001 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/777328 (https://phabricator.wikimedia.org/T305435)
[09:17:05] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34695/console" [puppet] - 10https://gerrit.wikimedia.org/r/777327 (https://phabricator.wikimedia.org/T305435) (owner: 10JMeybohm)
[09:17:26] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34696/console" [puppet] - 10https://gerrit.wikimedia.org/r/777328 (https://phabricator.wikimedia.org/T305435) (owner: 10JMeybohm)
[09:19:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T298565)', diff saved to https://phabricator.wikimedia.org/P24104 and previous config saved to /var/cache/conftool/dbconfig/20220405-091939-ladsgroup.json
[09:19:41] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance
[09:19:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:42] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance
[09:19:43] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[09:19:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:46] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:19:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T298565)', diff saved to https://phabricator.wikimedia.org/P24105 and previous config saved to /var/cache/conftool/dbconfig/20220405-091947-ladsgroup.json
[09:19:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:20:21] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] dbbackups: Monitor db_inventory rather than zarcillo section [puppet] - 10https://gerrit.wikimedia.org/r/776170 (https://phabricator.wikimedia.org/T301315) (owner: 10Jcrespo)
[09:20:22] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[09:20:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:20:58] <wikibugs>	 (03PS1) 10Btullis: Allow kikikube staging pods to access the analytics-meta test instance [puppet] - 10https://gerrit.wikimedia.org/r/777329 (https://phabricator.wikimedia.org/T301459)
[09:21:19] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/777329 (https://phabricator.wikimedia.org/T301459) (owner: 10Btullis)
[09:21:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[09:21:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:21:22] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[09:21:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:22:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[09:22:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:22:57] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[09:22:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:26:45] <wikibugs>	 (03PS2) 10Btullis: Allow kikikube staging pods to access the analytics-meta test instance [puppet] - 10https://gerrit.wikimedia.org/r/777329 (https://phabricator.wikimedia.org/T301459)
[09:29:17] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Move kubestagemaster* to bullseye and upstream docker [puppet] - 10https://gerrit.wikimedia.org/r/777309 (https://phabricator.wikimedia.org/T305435) (owner: 10JMeybohm)
[09:30:37] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/777329 (https://phabricator.wikimedia.org/T301459) (owner: 10Btullis)
[09:33:01] <wikibugs>	 (03PS2) 10Btullis: Remove the statsv source from the VarnishkafkaNoMessages alert [alerts] - 10https://gerrit.wikimedia.org/r/776912 (https://phabricator.wikimedia.org/T300246)
[09:33:12] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Move kubemaster2002 to bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777310 (https://phabricator.wikimedia.org/T305435) (owner: 10JMeybohm)
[09:33:20] <wikibugs>	 (03CR) 10Btullis: [V: 03+2 C: 03+2] Remove the statsv source from the VarnishkafkaNoMessages alert [alerts] - 10https://gerrit.wikimedia.org/r/776912 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis)
[09:33:28] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Move kubemaster2001 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/777311 (https://phabricator.wikimedia.org/T305435) (owner: 10JMeybohm)
[09:33:51] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Move kubemaster1002 to bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777327 (https://phabricator.wikimedia.org/T305435) (owner: 10JMeybohm)
[09:34:05] <wikibugs>	 (03CR) 10Kormat: dbtools: Port switchover-tmpl to python (031 comment) [software] - 10https://gerrit.wikimedia.org/r/776241 (https://phabricator.wikimedia.org/T304670) (owner: 10Ladsgroup)
[09:34:21] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Move kubemaster1001 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/777328 (https://phabricator.wikimedia.org/T305435) (owner: 10JMeybohm)
[09:38:51] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34697/console" [puppet] - 10https://gerrit.wikimedia.org/r/777329 (https://phabricator.wikimedia.org/T301459) (owner: 10Btullis)
[09:39:20] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Allow kikikube staging pods to access the analytics-meta test instance [puppet] - 10https://gerrit.wikimedia.org/r/777329 (https://phabricator.wikimedia.org/T301459) (owner: 10Btullis)
[09:40:24] <wikibugs>	 (03CR) 10Btullis: Allow kikikube staging pods to access the analytics-meta test instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777329 (https://phabricator.wikimedia.org/T301459) (owner: 10Btullis)
[09:42:30] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:43:06] <wikibugs>	 (03PS1) 10Filippo Giunchedi: WIP test replacing smokeping with blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/777330 (https://phabricator.wikimedia.org/T169860)
[09:43:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP test replacing smokeping with blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/777330 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[09:46:30] <godog>	 shush jerkins
[09:49:12] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[09:49:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:05] <wikibugs>	 (03PS2) 10Elukey: role::ml_k8s::master: change the svc eqiad IP subnet [puppet] - 10https://gerrit.wikimedia.org/r/776879 (https://phabricator.wikimedia.org/T304673)
[09:52:07] <wikibugs>	 (03PS2) 10Elukey: role::ml_k8s::master: change the codfw svc IP range [puppet] - 10https://gerrit.wikimedia.org/r/776880 (https://phabricator.wikimedia.org/T304673)
[09:52:09] <wikibugs>	 (03PS1) 10Elukey: install_server: set Bullseye for ml-serve-ctrl* nodes [puppet] - 10https://gerrit.wikimedia.org/r/777332 (https://phabricator.wikimedia.org/T304673)
[09:54:46] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw1001: use a custom name for the dataplane NIC [puppet] - 10https://gerrit.wikimedia.org/r/777333
[09:55:04] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloudgw1001: use a custom name for the dataplane NIC [puppet] - 10https://gerrit.wikimedia.org/r/777333 (https://phabricator.wikimedia.org/T304598)
[09:55:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Move Prometheus Apache setup to separate profile [puppet] - 10https://gerrit.wikimedia.org/r/775296 (owner: 10Muehlenhoff)
[09:58:57] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1001/34698/" [puppet] - 10https://gerrit.wikimedia.org/r/777333 (https://phabricator.wikimedia.org/T304598) (owner: 10Arturo Borrero Gonzalez)
[10:00:42] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: cp5002 memory errors on DIMM A4 - https://phabricator.wikimedia.org/T305423 (10Vgutierrez)
[10:02:24] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp5015 as cache::text_haproxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777317 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[10:02:43] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp6007 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777318 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[10:02:57] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] install_server: set Bullseye for ml-serve-ctrl* nodes [puppet] - 10https://gerrit.wikimedia.org/r/777332 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey)
[10:03:22] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp5013 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777319 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[10:03:30] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: cloudgw1001: use a custom name for the dataplane NIC [puppet] - 10https://gerrit.wikimedia.org/r/777333 (https://phabricator.wikimedia.org/T304598)
[10:03:47] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] homer: adjust the daily diff start time [puppet] - 10https://gerrit.wikimedia.org/r/777300 (owner: 10Volans)
[10:03:50] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] install_server: set Bullseye for ml-serve-ctrl* nodes [puppet] - 10https://gerrit.wikimedia.org/r/777332 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey)
[10:04:03] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp5007 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777320 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[10:04:43] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp5001 as cache::upload_haproxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777321 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[10:05:29] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp4035 as cache::text_haproxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777322 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[10:05:39] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected https://puppet-compiler.wmflabs.org/pcc-worker1002/34699/" [puppet] - 10https://gerrit.wikimedia.org/r/777333 (https://phabricator.wikimedia.org/T304598) (owner: 10Arturo Borrero Gonzalez)
[10:05:59] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp3052 as cache::text_haproxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777323 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[10:06:50] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp4027 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777324 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[10:06:56] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: cloudgw1001: use a custom name for the dataplane NIC [puppet] - 10https://gerrit.wikimedia.org/r/777333 (https://phabricator.wikimedia.org/T304598)
[10:07:27] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp4033 as cache::upload_haproxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777325 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[10:07:57] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp4021 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777326 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[10:10:02] <wikibugs>	 (03CR) 10Volans: [C: 03+2] homer: adjust the daily diff start time [puppet] - 10https://gerrit.wikimedia.org/r/777300 (owner: 10Volans)
[10:12:25] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[10:13:12] <wikibugs>	 (03CR) 10Btullis: "The change looks good, but I am confused by the statement:" [puppet] - 10https://gerrit.wikimedia.org/r/777292 (https://phabricator.wikimedia.org/T263277) (owner: 10Ayounsi)
[10:14:11] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "Got a question about how it gets applied, not really the patch itself." [puppet] - 10https://gerrit.wikimedia.org/r/777333 (https://phabricator.wikimedia.org/T304598) (owner: 10Arturo Borrero Gonzalez)
[10:17:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T298565)', diff saved to https://phabricator.wikimedia.org/P24107 and previous config saved to /var/cache/conftool/dbconfig/20220405-101709-ladsgroup.json
[10:17:10] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw1001: use a custom name for the dataplane NIC (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777333 (https://phabricator.wikimedia.org/T304598) (owner: 10Arturo Borrero Gonzalez)
[10:17:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:17:13] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[10:18:06] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[10:18:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:49] <wikibugs>	 (03PS1) 10Volans: prometheus: fix typo in docstring [software/pywmflib] - 10https://gerrit.wikimedia.org/r/777334
[10:19:58] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[10:19:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:35] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw: conntrackd: refresh NIC name [puppet] - 10https://gerrit.wikimedia.org/r/777335 (https://phabricator.wikimedia.org/T304598)
[10:23:18] <wikibugs>	 (03CR) 10Volans: [C: 03+2] prometheus: fix typo in docstring [software/pywmflib] - 10https://gerrit.wikimedia.org/r/777334 (owner: 10Volans)
[10:23:33] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: conntrackd: refresh NIC name [puppet] - 10https://gerrit.wikimedia.org/r/777335 (https://phabricator.wikimedia.org/T304598) (owner: 10Arturo Borrero Gonzalez)
[10:25:37] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1001/34700/netflow6001.drmrs.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/777292 (https://phabricator.wikimedia.org/T263277) (owner: 10Ayounsi)
[10:26:21] <wikibugs>	 (03Merged) 10jenkins-bot: prometheus: fix typo in docstring [software/pywmflib] - 10https://gerrit.wikimedia.org/r/777334 (owner: 10Volans)
[10:30:09] <logmsgbot>	 !log aborrero@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudgw1001.eqiad.wmnet with OS bullseye
[10:30:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:30:40] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudgw1001.eqiad.wmnet with OS bullseye
[10:30:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:30:48] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[10:30:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P24108 and previous config saved to /var/cache/conftool/dbconfig/20220405-103214-ladsgroup.json
[10:32:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:37:46] <jinxer-wm>	 (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs2003:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC
[10:38:06] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:39:01] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.SREBatchBase: additional customizations [cookbooks] - 10https://gerrit.wikimedia.org/r/776965 (owner: 10Volans)
[10:39:08] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.cdn.roll-restart-varnish: improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/776966 (owner: 10Volans)
[10:42:11] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw1001.eqiad.wmnet with reason: host reimage
[10:42:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:21] <wikibugs>	 (03Merged) 10jenkins-bot: sre.SREBatchBase: additional customizations [cookbooks] - 10https://gerrit.wikimedia.org/r/776965 (owner: 10Volans)
[10:42:24] <wikibugs>	 (03Merged) 10jenkins-bot: sre.cdn.roll-restart-varnish: improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/776966 (owner: 10Volans)
[10:45:03] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw1001.eqiad.wmnet with reason: host reimage
[10:45:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:47:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P24109 and previous config saved to /var/cache/conftool/dbconfig/20220405-104719-ladsgroup.json
[10:47:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:47:44] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "gerrit cannot rebase this one, please rebase by hand." [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749711 (owner: 10Majavah)
[10:47:52] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10ayounsi) Noted! To keep track of the IRC conversation, echoing it here: > is that a hard blocker? or could it be fixed before those hosts are l...
[10:55:12] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudgw1001.eqiad.wmnet with OS bullseye
[10:55:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:55:44] <wikibugs>	 (03PS3) 10Jgiannelos: maps: Re-enable OSM sync for on eqiad master [puppet] - 10https://gerrit.wikimedia.org/r/772453 (https://phabricator.wikimedia.org/T304984)
[10:56:10] <icinga-wm>	 PROBLEM - SSH on wtp1041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:56:21] <volans>	 !log installer spicerack v2.4.0 on the cumin hosts
[10:56:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:03:37] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[11:03:37] <icinga-wm>	 PROBLEM - Host tools.wmflabs.org is DOWN: PING CRITICAL - Packet loss = 100%
[11:03:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T298565)', diff saved to https://phabricator.wikimedia.org/P24110 and previous config saved to /var/cache/conftool/dbconfig/20220405-110224-ladsgroup.json
[11:03:37] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance
[11:03:37] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance
[11:03:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T298565)', diff saved to https://phabricator.wikimedia.org/P24111 and previous config saved to /var/cache/conftool/dbconfig/20220405-110232-ladsgroup.json
[11:03:37] <icinga-wm>	 RECOVERY - Host tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms
[11:03:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:03:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:03:38] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[11:03:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:03:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:03:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:05:17] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (Aitolkyn) - https://phabricator.wikimedia.org/T305299 (10MoritzMuehlenhoff) @Aitolkyn  Can you please sign https://phabricator.wikimedia.org/L3 ? Then we're good to go.
[11:05:26] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Allow wikikube staging pod range to access kafka eqiad-test cluster [puppet] - 10https://gerrit.wikimedia.org/r/777314 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis)
[11:06:11] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudgw1001.eqiad.wmnet
[11:06:11] <logmsgbot>	 !log aborrero@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host cloudgw1001.eqiad.wmnet
[11:06:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:06:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:06:24] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudgw1001.eqiad.wmnet
[11:06:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:09:31] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable access for Paramita Das [puppet] - 10https://gerrit.wikimedia.org/r/777339
[11:10:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Enable access for Paramita Das [puppet] - 10https://gerrit.wikimedia.org/r/777339 (owner: 10Muehlenhoff)
[11:10:13] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1001.eqiad.wmnet
[11:10:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:10:17] <wikibugs>	 (03Abandoned) 10MMandere: site: Reimage cp5002 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/776871 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[11:11:25] <wikibugs>	 (03PS2) 10Muehlenhoff: Enable access for Paramita Das [puppet] - 10https://gerrit.wikimedia.org/r/777339 (https://phabricator.wikimedia.org/T305298)
[11:11:27] <wikibugs>	 (03PS1) 10Majavah: P:openstack::puppetmaster: split ENC api to a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/777341 (https://phabricator.wikimedia.org/T295247)
[11:12:05] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[11:12:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:12:49] <wikibugs>	 (03Abandoned) 10MMandere: site: Reimage cp6007 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/776872 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[11:13:13] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[11:13:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:14:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Enable access for Paramita Das [puppet] - 10https://gerrit.wikimedia.org/r/777339 (https://phabricator.wikimedia.org/T305298) (owner: 10Muehlenhoff)
[11:15:05] <mmandere>	 !log depool cp5015 for reimage - T290005
[11:15:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:15:08] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[11:15:11] <wikibugs>	 (03PS2) 10Majavah: P:openstack::puppetmaster: split ENC api to a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/777341 (https://phabricator.wikimedia.org/T295247)
[11:16:36] <wikibugs>	 (03CR) 10MMandere: [C: 03+2] site: Reimage cp5015 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777317 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[11:18:29] <wikibugs>	 (03PS3) 10Majavah: P:openstack::puppetmaster: split ENC api to a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/777341 (https://phabricator.wikimedia.org/T295247)
[11:20:42] <wikibugs>	 (03PS4) 10Majavah: P:openstack::puppetmaster: split ENC api to a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/777341 (https://phabricator.wikimedia.org/T295247)
[11:21:28] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34704/console" [puppet] - 10https://gerrit.wikimedia.org/r/777341 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah)
[11:23:55] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp5015.eqsin.wmnet with OS buster
[11:23:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:04] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp5015.eqsin.wmnet with OS buster
[11:25:32] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[11:25:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:26:53] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (Aitolkyn) - https://phabricator.wikimedia.org/T305299 (10Aitolkyn) >>! In T305299#7831501, @MoritzMuehlenhoff wrote: > @Aitolkyn  Can you please sign https://phabricator.wikimedia.org/L3 ? Then we're good to go.  @MoritzMu...
[11:30:45] <wikibugs>	 (03PS5) 10Majavah: P:openstack::puppetmaster: split ENC api to a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/777341 (https://phabricator.wikimedia.org/T295247)
[11:31:17] <mmandere>	 !log depool cp6007 for reimage - T290005
[11:31:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:20] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[11:32:29] <wikibugs>	 (03PS6) 10Majavah: P:openstack::puppetmaster: split ENC api to a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/777341 (https://phabricator.wikimedia.org/T295247)
[11:32:52] <wikibugs>	 (03PS2) 10MMandere: site: Reimage cp6007 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777318 (https://phabricator.wikimedia.org/T290005)
[11:33:34] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34705/console" [puppet] - 10https://gerrit.wikimedia.org/r/777341 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah)
[11:34:23] <wikibugs>	 (03CR) 10MMandere: [C: 03+2] site: Reimage cp6007 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777318 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[11:34:34] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (paramita_das) - https://phabricator.wikimedia.org/T305298 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff @paramita_das: Your access has been enabled and you should have received an email with instructi...
[11:36:17] <wikibugs>	 (03PS7) 10Majavah: P:openstack::puppetmaster: split ENC api to a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/777341 (https://phabricator.wikimedia.org/T295247)
[11:37:34] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: move backups to /mnt/gitlab-backup [puppet] - 10https://gerrit.wikimedia.org/r/776230 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[11:37:41] <wikibugs>	 (03PS5) 10Jelto: gitlab: move backups to /mnt/gitlab-backup [puppet] - 10https://gerrit.wikimedia.org/r/776230 (https://phabricator.wikimedia.org/T274463)
[11:38:20] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (paramita_das) - https://phabricator.wikimedia.org/T305298 (10MoritzMuehlenhoff) And please note that your username is "paramd" (the UID you used when creating the account on wikitech).
[11:38:29] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6007.drmrs.wmnet with OS buster
[11:38:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:38:38] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6007.drmrs.wmnet with OS buster
[11:38:43] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable access for aitolkyn [puppet] - 10https://gerrit.wikimedia.org/r/777342
[11:39:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1132 T305427', diff saved to https://phabricator.wikimedia.org/P24112 and previous config saved to /var/cache/conftool/dbconfig/20220405-113944-root.json
[11:39:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:39:48] <stashbot>	 T305427: Slow DB query on 10.6 - https://phabricator.wikimedia.org/T305427
[11:41:19] <wikibugs>	 (03PS8) 10Majavah: P:openstack::puppetmaster: split ENC api to a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/777341 (https://phabricator.wikimedia.org/T295247)
[11:42:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Enable access for aitolkyn [puppet] - 10https://gerrit.wikimedia.org/r/777342 (owner: 10Muehlenhoff)
[11:45:30] <logmsgbot>	 !log jnuche@deploy1002 Started scap: resync wmf.6 to reapply security patches - T305212
[11:45:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:45:33] <stashbot>	 T305212: 1.39.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T305212
[11:47:15] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5015.eqsin.wmnet with reason: host reimage
[11:47:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:48:20] <logmsgbot>	 !log jnuche@deploy1002 Finished scap: resync wmf.6 to reapply security patches - T305212 (duration: 02m 50s)
[11:48:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:48:48] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (Aitolkyn) - https://phabricator.wikimedia.org/T305299 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff @Aitolkyn  Your access has been enabled and you should have received an email with instructions how...
[11:49:40] <wikibugs>	 (03PS1) 10Jaime Nuche: group0 wikis to 1.39.0-wmf.6  refs T305212 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777344
[11:49:42] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+2] group0 wikis to 1.39.0-wmf.6  refs T305212 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777344 (owner: 10Jaime Nuche)
[11:50:24] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.6  refs T305212 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777344 (owner: 10Jaime Nuche)
[11:50:26] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5015.eqsin.wmnet with reason: host reimage
[11:50:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:51:51] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: wmcs: toolforge: k8s: factorize build/deplo code into a manager class [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/774459
[11:52:10] <logmsgbot>	 !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.6  refs T305212
[11:52:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:52:13] <stashbot>	 T305212: 1.39.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T305212
[11:53:47] <wikibugs>	 (03PS1) 10Jelto: gitlab: add backup and restore intervals to cloud hiera [puppet] - 10https://gerrit.wikimedia.org/r/777345 (https://phabricator.wikimedia.org/T274463)
[11:53:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[11:53:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:54:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[11:54:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[11:54:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:54:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:55:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[11:55:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:55:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmcs: toolforge: k8s: factorize build/deplo code into a manager class [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/774459 (owner: 10Arturo Borrero Gonzalez)
[11:56:08] <icinga-wm>	 RECOVERY - SSH on wtp1041.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:56:39] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6007.drmrs.wmnet with reason: host reimage
[11:56:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:57:46] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34706/console" [puppet] - 10https://gerrit.wikimedia.org/r/777345 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[11:58:10] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: add backup and restore intervals to cloud hiera [puppet] - 10https://gerrit.wikimedia.org/r/777345 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[12:28:48] <wikibugs>	 10SRE, 10SRE Observability (FY2021/2022-Q4): SLO dashboard refinements - https://phabricator.wikimedia.org/T302842 (10lmata)
[12:32:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P24115 and previous config saved to /var/cache/conftool/dbconfig/20220405-123227-ladsgroup.json
[12:32:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:39:54] <wikibugs>	 (03PS2) 10Filippo Giunchedi: WIP test replacing smokeping with blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/777330 (https://phabricator.wikimedia.org/T169860)
[12:39:56] <wikibugs>	 (03PS1) 10Filippo Giunchedi: WIP move core routers definitions to hiera [puppet] - 10https://gerrit.wikimedia.org/r/777347 (https://phabricator.wikimedia.org/T169860)
[12:40:34] <mmandere>	 !log pool cp5015 with HAProxy as TLS termination layer - T290005
[12:40:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:40:37] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[12:41:08] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP test replacing smokeping with blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/777330 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[12:41:16] <wikibugs>	 (03PS1) 10JMeybohm: Copy all helmfile-defaults to each subchart namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/777348 (https://phabricator.wikimedia.org/T301454)
[12:42:01] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp1085.eqiad.wmnet
[12:42:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:43:32] <icinga-wm>	 PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:43:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[12:43:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:43:56] <wikibugs>	 (03PS3) 10Filippo Giunchedi: WIP test replacing smokeping with blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/777330 (https://phabricator.wikimedia.org/T169860)
[12:46:03] <mmandere>	 !log pool cp6007 with HAProxy as TLS termination layer - T290005
[12:46:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:46:06] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[12:46:51] <wikibugs>	 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10ayounsi) >>! In T300977#7821702, @Isaac wrote: > Chiming in as a heavy user of the stat boxes. It's difficult for me to fo...
[12:47:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T298565)', diff saved to https://phabricator.wikimedia.org/P24116 and previous config saved to /var/cache/conftool/dbconfig/20220405-124732-ladsgroup.json
[12:47:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:47:34] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance
[12:47:35] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[12:47:36] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance
[12:47:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:47:37] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[12:47:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:47:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:47:40] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[12:47:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:47:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T298565)', diff saved to https://phabricator.wikimedia.org/P24117 and previous config saved to /var/cache/conftool/dbconfig/20220405-124745-ladsgroup.json
[12:47:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:50:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[12:50:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:50:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[12:50:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:53:21] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1085.eqiad.wmnet
[12:53:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:54:39] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp3064.esams.wmnet
[12:54:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:56:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[12:56:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:57:39] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "This looks excellent. Many thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/777348 (https://phabricator.wikimedia.org/T301454) (owner: 10JMeybohm)
[12:58:38] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Copy all helmfile-defaults to each subchart namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/777348 (https://phabricator.wikimedia.org/T301454) (owner: 10JMeybohm)
[12:59:19] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host dragonfly-supernode1001.eqiad.wmnet
[12:59:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220405T1300).
[13:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[13:00:22] <wikibugs>	 10SRE, 10Performance-Team: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460 (10MoritzMuehlenhoff)
[13:01:17] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dragonfly-supernode1001.eqiad.wmnet
[13:01:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:03:26] <wikibugs>	 (03Merged) 10jenkins-bot: Copy all helmfile-defaults to each subchart namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/777348 (https://phabricator.wikimedia.org/T301454) (owner: 10JMeybohm)
[13:03:35] <moritzm>	 !log installing openssl updates from buster 10.12 point release
[13:03:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:25] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3064.esams.wmnet
[13:04:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:06:10] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:07:46] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4030.ulsfo.wmnet
[13:07:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:08:24] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:11:18] <zabe>	 I am a bit late for the window, but is there someone who can deploy two config patches for me?
[13:11:50] <taavi>	 zabe: hey, I'm around
[13:12:25] <taavi>	 looking
[13:12:25] <zabe>	 taavi, thx. I added them to the calender.
[13:13:00] <wikibugs>	 (03PS3) 10Majavah: Pin CheckUser actor migration to old schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776250 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[13:13:03] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] Pin CheckUser actor migration to old schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776250 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[13:13:53] <wikibugs>	 (03Merged) 10jenkins-bot: Pin CheckUser actor migration to old schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776250 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[13:14:37] <taavi>	 zabe: pulled to mwdebug1001, is there anything to test?
[13:14:53] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4030.ulsfo.wmnet
[13:14:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:25] <wikibugs>	 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Infrastructure-Foundations, 10SRE Observability (FY2021/2022-Q3): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10CDanis)
[13:15:29] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Advertised RSS/Atom feeds for wikimediastatus.net don't work - https://phabricator.wikimedia.org/T305174 (10CDanis) 05Open→03Resolved a:03CDanis
[13:15:50] <zabe>	 taavi, not really. nothing seems to break on testwiki, thats all I can really test
[13:16:02] <taavi>	 ok, I'll just sync then
[13:16:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:16:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:17:02] <logmsgbot>	 !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:776250|Pin CheckUser actor migration to old schema (T233004)]] (duration: 00m 54s)
[13:17:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:17:04] <stashbot>	 T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004
[13:17:19] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] Start writing to $wmgUdp2logDest the same value as to $wmfUdp2logDest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776257 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[13:17:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:17:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:17:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:17:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host apt2001.wikimedia.org
[13:18:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:30] <wikibugs>	 (03Merged) 10jenkins-bot: Start writing to $wmgUdp2logDest the same value as to $wmfUdp2logDest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776257 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[13:18:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:18:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:06] <taavi>	 zabe: pulled the second too, I assume this too is not testable?
[13:19:47] <zabe>	 taavi, yep
[13:19:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast4003.wikimedia.org
[13:19:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:59] <taavi>	 ok, syncing
[13:20:03] <wikibugs>	 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10ayounsi) @BTullis > I realize that this suggestion increases the scope if the task considerably yup :) We unfortunately do...
[13:20:55] <logmsgbot>	 !log taavi@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:776257|Start writing to $wmgUdp2logDest the same value as to $wmfUdp2logDest (T45956)]] (duration: 00m 54s)
[13:20:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:20:58] <stashbot>	 T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956
[13:21:04] <taavi>	 ok, done
[13:21:18] <jinxer-wm>	 (ProbeDown) firing: Service apt:80 has failed probes (http_apt_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:21:28] <zabe>	 thanks
[13:21:28] <taavi>	 umh
[13:21:34] <taavi>	 I don't think that alert is related
[13:21:36] <godog>	 uuhh, investigating
[13:21:39] <godog>	 not related no
[13:21:48] * Emperor here
[13:21:59] * volans here
[13:22:08] <taavi>	 looks like caused by apt2001 restart?
[13:22:11] <moritzm>	 might be the apt2001 reboot? but why is that suddently alerting?
[13:22:24] <vgutierrez>	 new paging probes :)
[13:22:43] <XioNoX>	 is it possible to know the full url of the service being unrech?
[13:23:01] <Emperor>	 apt2001 uptime 2 mins
[13:23:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4003.wikimedia.org
[13:23:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:36] <moritzm>	 yes, apt2001 was intentionally rebooted along with downtime etc. and it's also the passive host
[13:23:43] <moritzm>	 so that check should not alert at all
[13:23:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:23:45] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2001.wikimedia.org
[13:23:45] <godog>	 XioNoX: yes, it is in the logs
[13:23:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:52] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on kubestagemaster2001.codfw.wmnet with reason: reimage
[13:23:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:55] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on kubestagemaster2001.codfw.wmnet with reason: reimage
[13:23:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:12] <XioNoX>	 godog: where?
[13:24:15] <Emperor>	 moritzm: I'll go ACK the VO alert then
[13:24:23] <moritzm>	 Emperor: ack, thx
[13:25:14] <godog>	 XioNoX: there's a link at https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown but I'll be linking the logs from the grafana dashboard too
[13:25:16] <moritzm>	 also apt.wikimedia.org being down has no user-visible impact at all (except that Puppet runs will get stalled/failed), there's no reason it should page to begin with
[13:25:45] <godog>	 moritzm: ok to set page: false in service::catalog for it then I guess ?
[13:26:18] <jinxer-wm>	 (ProbeDown) resolved: Service apt:80 has failed probes (http_apt_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:27:35] <moritzm>	 godog: yeah, let's make a patch and ask for people's opinions? from my POV there's no need to make it page, but would like to have a second opinion at least
[13:27:52] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv4: OpenConfirm - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:28:14] <godog>	 moritzm: ack, yeah I tend to agree there's not really a paging need
[13:28:23] <kormat>	 godog: looking at the logs dashboard linked from that wiki page, i don't see a way to spot the issue with apt
[13:28:41] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+2] Move kubestagemaster* to bullseye and upstream docker [puppet] - 10https://gerrit.wikimedia.org/r/777309 (https://phabricator.wikimedia.org/T305435) (owner: 10JMeybohm)
[13:28:57] <wikibugs>	 (03PS2) 10JMeybohm: Move kubestagemaster* to bullseye and upstream docker [puppet] - 10https://gerrit.wikimedia.org/r/777309 (https://phabricator.wikimedia.org/T305435)
[13:29:27] <Emperor>	 kormat: I fished the IP out of the email from VO
[13:30:13] <kormat>	 Emperor: i don't think i got an email?
[13:30:15] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Phabricator, 10Triagers, 10acl*phabricator: SRE access request to join #triagers for user lmata - https://phabricator.wikimedia.org/T305463 (10lmata)
[13:30:30] <Emperor>	 kormat: ah, VO pages me by email (at least initially)
[13:30:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:30:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:30:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:46] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Phabricator, 10Triagers, and 2 others: SRE access request to join #triagers for user lmata - https://phabricator.wikimedia.org/T305463 (10lmata)
[13:31:31] <godog>	 kormat: that's fair yeah, the unfiltered logs dashboard is dense, the alert (in the UI) does have a link to a filtered view (i.e. with 'service.name'), I'll be working to make it more obvious when sth fails in the logs
[13:31:35] <moritzm>	 godog: ack, I'll prepare a patch later
[13:31:42] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp1086.eqiad.wmnet
[13:31:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:51] <wikibugs>	 (03PS1) 10JMeybohm: Revert "Move kubestagemaster* to bullseye and upstream docker" [puppet] - 10https://gerrit.wikimedia.org/r/777016
[13:32:30] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:32:51] <wikibugs>	 (03PS1) 10Volans: sre.cdn.roll-restart-varnish: use Thanos [cookbooks] - 10https://gerrit.wikimedia.org/r/777353
[13:32:56] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Phabricator, 10Triagers, and 3 others: SRE access request to join #triagers for user lmata - https://phabricator.wikimedia.org/T305463 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup {{done}}
[13:33:24] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Revert "Move kubestagemaster* to bullseye and upstream docker" [puppet] - 10https://gerrit.wikimedia.org/r/777016 (owner: 10JMeybohm)
[13:33:26] <wikibugs>	 (03PS2) 10Volans: sre.hosts.reimage: call Ipmi.remove_boot_override [cookbooks] - 10https://gerrit.wikimedia.org/r/774927 (https://phabricator.wikimedia.org/T304434)
[13:34:01] <godog>	 well, at least the paging/probing works as expected
[13:34:06] <wikibugs>	 (03PS2) 10JMeybohm: Move kubemaster2002 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/777310 (https://phabricator.wikimedia.org/T305435)
[13:35:48] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Move kubemaster2002 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/777310 (https://phabricator.wikimedia.org/T305435) (owner: 10JMeybohm)
[13:36:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:36:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:58] <wikibugs>	 (03PS1) 10JMeybohm: Revert "Move kubemaster2002 to bullseye" [puppet] - 10https://gerrit.wikimedia.org/r/777017
[13:36:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[13:37:40] <wikibugs>	 10SRE, 10Phabricator, 10SRE Observability (FY2021/2022-Q4), 10User-Ladsgroup: SRE access request to join #triagers for user lmata - https://phabricator.wikimedia.org/T305463 (10Majavah)
[13:38:44] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Revert "Move kubemaster2002 to bullseye" [puppet] - 10https://gerrit.wikimedia.org/r/777017 (owner: 10JMeybohm)
[13:39:30] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cuminunpriv1001.eqiad.wmnet
[13:39:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:40:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host deneb.codfw.wmnet
[13:40:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:00] <wikibugs>	 (03PS1) 10JMeybohm: Move kubestagemaster* to bullseye and upstream docker [puppet] - 10https://gerrit.wikimedia.org/r/777018
[13:41:18] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: mass Yahoo / AOL bounces mailman - https://phabricator.wikimedia.org/T232417 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup Boldly doing so. Reopen if we get it again.
[13:41:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cuminunpriv1001.eqiad.wmnet
[13:41:33] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on 6 hosts with reason: Cluster re-init for new IP ranges
[13:41:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:38] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 6 hosts with reason: Cluster re-init for new IP ranges
[13:41:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:42:06] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Move kubestagemaster* to bullseye and upstream docker [puppet] - 10https://gerrit.wikimedia.org/r/777018 (owner: 10JMeybohm)
[13:42:10] <icinga-wm>	 RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:43:26] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast3004.wikimedia.org
[13:43:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:44:48] <icinga-wm>	 RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:44:57] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cp1086.eqiad.wmnet
[13:44:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:16] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.hosts.reimage: call Ipmi.remove_boot_override [cookbooks] - 10https://gerrit.wikimedia.org/r/774927 (https://phabricator.wikimedia.org/T304434) (owner: 10Volans)
[13:45:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host deneb.codfw.wmnet
[13:45:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:47:08] <icinga-wm>	 PROBLEM - Check systemd state on cp1086 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens2f0np0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:48:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T298565)', diff saved to https://phabricator.wikimedia.org/P24119 and previous config saved to /var/cache/conftool/dbconfig/20220405-134801-ladsgroup.json
[13:48:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:48:04] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[13:48:51] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.reimage: call Ipmi.remove_boot_override [cookbooks] - 10https://gerrit.wikimedia.org/r/774927 (https://phabricator.wikimedia.org/T304434) (owner: 10Volans)
[13:49:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast3004.wikimedia.org
[13:49:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:49:57] <wikibugs>	 (03PS3) 10Ladsgroup: dbtools: Port switchover-tmpl to python [software] - 10https://gerrit.wikimedia.org/r/776241 (https://phabricator.wikimedia.org/T304670)
[13:50:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] dbtools: Port switchover-tmpl to python [software] - 10https://gerrit.wikimedia.org/r/776241 (https://phabricator.wikimedia.org/T304670) (owner: 10Ladsgroup)
[13:52:47] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Add --dbgroupdefault=dump to every major dump run [dumps] - 10https://gerrit.wikimedia.org/r/767477 (https://phabricator.wikimedia.org/T138208) (owner: 10Ladsgroup)
[13:53:15] <wikibugs>	 (03Merged) 10jenkins-bot: Add --dbgroupdefault=dump to every major dump run [dumps] - 10https://gerrit.wikimedia.org/r/767477 (https://phabricator.wikimedia.org/T138208) (owner: 10Ladsgroup)
[13:53:30] <icinga-wm>	 PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:53:38] <Amir1>	 o/
[13:58:33] <mmandere>	 !log depool cp5013 for reimage - T290005
[13:58:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:58:36] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[13:59:58] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 129 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:00:28] <Amir1>	 jouncebot: nowandnext
[14:00:28] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 59 minute(s)
[14:00:28] <jouncebot>	 In 1 hour(s) and 59 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220405T1600)
[14:01:10] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:01:56] <wikibugs>	 (03PS2) 10Ladsgroup: Enable videojs on all of DIP wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775294 (https://phabricator.wikimedia.org/T248418)
[14:03:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P24120 and previous config saved to /var/cache/conftool/dbconfig/20220405-140306-ladsgroup.json
[14:03:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:03:20] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Enable videojs on all of DIP wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775294 (https://phabricator.wikimedia.org/T248418) (owner: 10Ladsgroup)
[14:03:48] <wikibugs>	 (03PS2) 10MMandere: site: Reimage cp5013 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777319 (https://phabricator.wikimedia.org/T290005)
[14:04:10] <wikibugs>	 (03Merged) 10jenkins-bot: Enable videojs on all of DIP wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775294 (https://phabricator.wikimedia.org/T248418) (owner: 10Ladsgroup)
[14:05:10] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1064 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:05:25] <wikibugs>	 (03CR) 10MMandere: [C: 03+2] site: Reimage cp5013 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777319 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[14:05:28] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:775294|Enable videojs on all of DIP wikis (T248418)]] (duration: 00m 53s)
[14:05:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:05:31] <stashbot>	 T248418: Roll out videojs as the only video/audio player on all Wikimedia wikis - https://phabricator.wikimedia.org/T248418
[14:07:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[14:07:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[14:08:19] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp5013.eqsin.wmnet with OS buster
[14:08:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[14:08:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:28] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp5013.eqsin.wmnet with OS buster
[14:08:47] <wikibugs>	 (03PS1) 10Muehlenhoff: Don't make apt.wikimedia.org page [puppet] - 10https://gerrit.wikimedia.org/r/777357
[14:09:13] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:09:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:12:05] <mmandere>	 !log depool cp5007 for reimage - T290005
[14:12:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:12:07] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[14:12:25] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[14:12:30] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:12:37] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/777357 (owner: 10Muehlenhoff)
[14:13:12] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: cache::base: add check to netmapper modification [puppet] - 10https://gerrit.wikimedia.org/r/773451 (https://phabricator.wikimedia.org/T302471)
[14:14:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[14:14:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:34] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] httpbb: Delete the git::clone and install via deb package [puppet] - 10https://gerrit.wikimedia.org/r/776977 (https://phabricator.wikimedia.org/T299705) (owner: 10RLazarus)
[14:15:02] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] cache::base: add check to netmapper modification [puppet] - 10https://gerrit.wikimedia.org/r/773451 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto)
[14:15:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[14:15:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[14:15:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:16:35] <wikibugs>	 (03PS2) 10MMandere: site: Reimage cp5007 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777320 (https://phabricator.wikimedia.org/T290005)
[14:16:55] <jinxer-wm>	 (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[14:17:47] <wikibugs>	 (03CR) 10MMandere: [C: 03+2] site: Reimage cp5007 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777320 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[14:18:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P24121 and previous config saved to /var/cache/conftool/dbconfig/20220405-141811-ladsgroup.json
[14:18:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:18:24] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] check: Read list of valid sections/valid backup jobs from a file [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/776171 (https://phabricator.wikimedia.org/T301315) (owner: 10Jcrespo)
[14:18:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:18:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:13] <wikibugs>	 (03PS2) 10Muehlenhoff: Don't make apt.wikimedia.org page [puppet] - 10https://gerrit.wikimedia.org/r/777357
[14:19:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:21:26] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: varnish::frontend: remove temporary rate-limits [puppet] - 10https://gerrit.wikimedia.org/r/773454
[14:21:36] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] cache::base: add check to netmapper modification [puppet] - 10https://gerrit.wikimedia.org/r/773451 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto)
[14:21:49] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/777357 (owner: 10Muehlenhoff)
[14:21:55] <jinxer-wm>	 (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[14:21:58] <icinga-wm>	 PROBLEM - LVS inference codfw port 30443/tcp - Inference ML service IPv4 on inference.svc.codfw.wmnet is CRITICAL: connect to address inference.discovery.wmnet and port 30443: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[14:22:11] <elukey>	 klausman: --^
[14:22:15] <wikibugs>	 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul)
[14:22:26] <wikibugs>	 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) p:05Triage→03Medium
[14:22:30] <jinxer-wm>	 (JobUnavailable) firing: (7) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:22:46] <wikibugs>	 (03PS1) 10JMeybohm: Move kubemaster2002 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/777021
[14:22:49] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp5007.eqsin.wmnet with OS buster
[14:22:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:55] <elukey>	 the LVS alarm is fine, we have stopped the ml eqiad cluster
[14:22:58] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp5007.eqsin.wmnet with OS buster
[14:24:00] <klausman>	 elukey: I can't find that host in Icinga
[14:24:38] <klausman>	 that is... codfw
[14:24:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:26:12] <_joe_>	 klausman: AFAICT, your alerting was actually reaching eqiad and not codfw
[14:26:15] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job calico-felix in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:26:42] <elukey>	 _joe_ it may be misleading, it says that inference.discovery.wmnet is not reachable 
[14:26:58] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference_30443: Servers ml-serve1004.eqiad.wmnet, ml-serve1003.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:27:15] <elukey>	 okok we are going to ack these
[14:28:13] <_joe_>	 curl https://inference.svc.codfw.wmnet:30443/ returns 204, so I assume the alert is wrong, but we can check later 
[14:28:22] <elukey>	 sure sure 
[14:28:35] * elukey takes notes
[14:28:53] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: varnish::frontend: remove temporary rate-limits [puppet] - 10https://gerrit.wikimedia.org/r/773454
[14:28:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[14:29:05] <wikibugs>	 (03PS1) 10Jelto: gitlab: fix duplicate backup_dir hiera key [puppet] - 10https://gerrit.wikimedia.org/r/777359 (https://phabricator.wikimedia.org/T274463)
[14:30:32] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on pc2014.codfw.wmnet with reason: Rebooting for T303174
[14:30:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:34] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on pc2014.codfw.wmnet with reason: Rebooting for T303174
[14:30:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:27] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5013.eqsin.wmnet with reason: host reimage
[14:31:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:39] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on kubestagemaster1001.eqiad.wmnet with reason: reimage
[14:31:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:41] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on kubestagemaster1001.eqiad.wmnet with reason: reimage
[14:31:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:51] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] varnish::frontend: remove temporary rate-limits (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773454 (owner: 10Giuseppe Lavagetto)
[14:31:59] <logmsgbot>	 !log mmandere@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp5013.eqsin.wmnet with reason: host reimage
[14:32:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:08] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34707/console" [puppet] - 10https://gerrit.wikimedia.org/r/777359 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[14:33:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T298565)', diff saved to https://phabricator.wikimedia.org/P24122 and previous config saved to /var/cache/conftool/dbconfig/20220405-143316-ladsgroup.json
[14:33:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:21] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[14:33:22] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance
[14:33:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:24] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance
[14:33:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:25] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 14 hosts with reason: Maintenance
[14:33:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:35] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 14 hosts with reason: Maintenance
[14:33:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:54] <icinga-wm>	 ACKNOWLEDGEMENT - LVS inference codfw port 30443/tcp - Inference ML service IPv4 on inference.svc.codfw.wmnet is CRITICAL: connect to address inference.discovery.wmnet and port 30443: No route to host Klausman Cluster re-init for new IP ranges (T304673) https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[14:33:54] <icinga-wm>	 ACKNOWLEDGEMENT - LVS inference eqiad port 30443/tcp - Inference ML service IPv4 on inference.svc.eqiad.wmnet is CRITICAL: connect to address inference.discovery.wmnet and port 30443: No route to host Klausman Cluster re-init for new IP ranges (T304673) https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[14:33:58] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[14:34:28] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::ml_k8s::master: change the svc eqiad IP subnet [puppet] - 10https://gerrit.wikimedia.org/r/776879 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey)
[14:34:36] <wikibugs>	 (03PS3) 10Elukey: role::ml_k8s::master: change the svc eqiad IP subnet [puppet] - 10https://gerrit.wikimedia.org/r/776879 (https://phabricator.wikimedia.org/T304673)
[14:35:04] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Change the Calico's pod IP subnet for ml-serve-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/776876 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey)
[14:36:15] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job calico-felix in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:36:29] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on pc2011.codfw.wmnet with reason: Rebooting for T303174
[14:36:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:36:30] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on pc2011.codfw.wmnet with reason: Rebooting for T303174
[14:36:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:37:20] <icinga-wm>	 PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01286 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[14:37:46] <jinxer-wm>	 (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs2003:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC
[14:38:04] <wikibugs>	 (03CR) 10Ladsgroup: dbtools: Port switchover-tmpl to python (031 comment) [software] - 10https://gerrit.wikimedia.org/r/776241 (https://phabricator.wikimedia.org/T304670) (owner: 10Ladsgroup)
[14:38:49] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] dbtools: Port switchover-tmpl to python [software] - 10https://gerrit.wikimedia.org/r/776241 (https://phabricator.wikimedia.org/T304670) (owner: 10Ladsgroup)
[14:38:54] <volans>	 moritzm: did you restarted any of the puppetmasters by any chance? ^^^ Widespread puppet agent failures
[14:40:07] <wikibugs>	 (03PS4) 10Ladsgroup: dbtools: Port switchover-tmpl to python [software] - 10https://gerrit.wikimedia.org/r/776241 (https://phabricator.wikimedia.org/T304670)
[14:41:15] <jinxer-wm>	 (JobUnavailable) firing: (11) Reduced availability for job calico-felix in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:41:54] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! Thanks Moritz" [puppet] - 10https://gerrit.wikimedia.org/r/777357 (owner: 10Muehlenhoff)
[14:42:28] <icinga-wm>	 RECOVERY - Check systemd state on cp1086 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:44:00] <vgutierrez>	 !log re-pool cp1086
[14:44:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:14] <moritzm>	 volans: kind of, I upgraded mod-auth-cas (which is installed on the Puppet master frontends for config-master.w.o) and that involves an Apache restart
[14:44:44] <moritzm>	 should recover soonish
[14:44:45] <volans>	 ack
[14:44:47] <volans>	 thx
[14:45:38] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - ml-ctrl_6443: Servers ml-serve-ctrl1002.eqiad.wmnet are marked down but pooled: inference_30443: Servers ml-serve1004.eqiad.wmnet, ml-serve1003.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:45:54] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: varnish::frontend: remove normalization for parameter [puppet] - 10https://gerrit.wikimedia.org/r/773455
[14:46:15] <jinxer-wm>	 (JobUnavailable) firing: (11) Reduced availability for job calico-felix in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:46:47] <wikibugs>	 (03PS1) 10Jaime Nuche: test [mediawiki-config] (sandbox/jnuche) - 10https://gerrit.wikimedia.org/r/777363
[14:47:10] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5007.eqsin.wmnet with reason: host reimage
[14:47:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:56] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:47:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: (6) ml-serve-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:48:10] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] varnish::frontend: remove normalization for parameter [puppet] - 10https://gerrit.wikimedia.org/r/773455 (owner: 10Giuseppe Lavagetto)
[14:48:18] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1005.mgmt.eqiad.wmnet with reboot policy FORCED
[14:48:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:36] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on pc2012.codfw.wmnet with reason: Rebooting for T303174
[14:48:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:38] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on pc2012.codfw.wmnet with reason: Rebooting for T303174
[14:48:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:04] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1006.mgmt.eqiad.wmnet with reboot policy FORCED
[14:49:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [cookbooks] - 10https://gerrit.wikimedia.org/r/777353 (owner: 10Volans)
[14:49:34] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-cache1001.mgmt.eqiad.wmnet with reboot policy FORCED
[14:49:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:56] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:50:06] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5007.eqsin.wmnet with reason: host reimage
[14:50:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:08] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4036.ulsfo.wmnet
[14:50:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:11] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1007.mgmt.eqiad.wmnet with reboot policy FORCED
[14:50:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:29] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.provision (exit_code=97) for host ml-cache1002.mgmt.eqiad.wmnet with reboot policy FORCED
[14:50:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:44] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp1087.eqiad.wmnet
[14:50:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:50] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.provision (exit_code=97) for host ml-cache1003.mgmt.eqiad.wmnet with reboot policy FORCED
[14:50:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:03] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp3065.esams.wmnet
[14:51:03] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.provision (exit_code=97) for host ml-serve1008.mgmt.eqiad.wmnet with reboot policy FORCED
[14:51:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:54:27] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10Cmjohnson) @elukey closed
[14:55:22] <wikibugs>	 (03PS4) 10Jelto: gitlab_runner: override ExecStart in service unit for non-root [puppet] - 10https://gerrit.wikimedia.org/r/775821 (https://phabricator.wikimedia.org/T295481)
[14:55:45] <wikibugs>	 (03PS1) 10JMeybohm: Don't schedule calico kube-controllers on master nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/777364
[14:56:19] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on pc2013.codfw.wmnet with reason: Rebooting for T303174
[14:56:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:56:20] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on pc2013.codfw.wmnet with reason: Rebooting for T303174
[14:56:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:58:29] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] gitlab_runner: override ExecStart in service unit for non-root [puppet] - 10https://gerrit.wikimedia.org/r/775821 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto)
[15:00:34] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5013.eqsin.wmnet with OS buster
[15:00:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:43] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp5013.eqsin.wmnet with OS buster com...
[15:01:01] <logmsgbot>	 !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Taking host offline to upgrade to Bullseye
[15:01:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:01:03] <logmsgbot>	 !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Taking host offline to upgrade to Bullseye
[15:01:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:01:47] <wikibugs>	 (03PS1) 10Btullis: Update the chart to address issues with secrets and CI [deployment-charts] - 10https://gerrit.wikimedia.org/r/777365 (https://phabricator.wikimedia.org/T301454)
[15:02:30] <jinxer-wm>	 (JobUnavailable) firing: (11) Reduced availability for job calico-felix in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:02:54] <mmandere>	 !log pool cp5013 with HAProxy as TLS termination layer - T290005
[15:02:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:03:01] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[15:03:43] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es2020.codfw.wmnet with reason: Rebooting for T303174
[15:03:45] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es2020.codfw.wmnet with reason: Rebooting for T303174
[15:03:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:03:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:30] <wikibugs>	 (03CR) 10Herron: [C: 03+2] logstash: set unit TimeoutStopSec of 2 minutes [puppet] - 10https://gerrit.wikimedia.org/r/776982 (https://phabricator.wikimedia.org/T305403) (owner: 10Herron)
[15:09:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[15:10:32] <logmsgbot>	 !log razzi@cumin1001 START - Cookbook sre.hosts.reimage for host dbstore1003.eqiad.wmnet with OS bullseye
[15:10:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:36] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:11:02] <icinga-wm>	 PROBLEM - purged service on cp4036 is CRITICAL: CRITICAL - Expecting active but unit purged is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:11:15] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job calico-felix in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:11:22] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1087.eqiad.wmnet
[15:11:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:11:35] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3065.esams.wmnet
[15:11:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:11:36] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es2022.codfw.wmnet with reason: Rebooting for T303174
[15:11:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:11:38] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es2022.codfw.wmnet with reason: Rebooting for T303174
[15:11:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:00] <moritzm>	 !log installing atftp security updates
[15:12:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:22] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[15:12:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:29] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: mobileapps: Increase CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/777369 (https://phabricator.wikimedia.org/T305482)
[15:15:18] <jinxer-wm>	 (ProbeDown) firing: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:15:28] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1001.eqiad.wmnet with OS bullseye
[15:15:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:15:40] <rzl>	 👋
[15:15:41] <godog>	 ooof mmhh
[15:15:46] * Emperor is here
[15:15:57] <_joe_>	 looks like wdqs is down in eqiad?
[15:16:24] <godog>	 indeed
[15:16:27] <_joe_>	 godog: do we have the url that is probed somewhere?
[15:16:56] <Emperor>	 wdqs-ssl:443 ; I think what that means is defined in puppet?
[15:17:08] <godog>	 _joe_: yes, I am fixing the logs dashboard, in the meantime https://logstash.wikimedia.org/goto/f03a5816909ab0f402ec801c17aca444
[15:17:15] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] mobileapps: Increase CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/777369 (https://phabricator.wikimedia.org/T305482) (owner: 10Alexandros Kosiaris)
[15:17:22] <_joe_>	 lvs sees everything as up
[15:17:27] <godog>	 i.e. /readiness-probe
[15:17:33] <wikibugs>	 (03PS2) 10Volans: sre.cdn.roll-restart-varnish: use Thanos [cookbooks] - 10https://gerrit.wikimedia.org/r/777353
[15:17:47] <Emperor>	 eqiad host is 10.2.2.32
[15:18:30] <godog>	 yeah looks like it was quite a short blip heh
[15:19:06] <godog>	 clearly the alert is too trigger happy, my apologies for the page folks
[15:19:14] <_joe_>	 yeah, also np
[15:19:20] <wikibugs>	 (03PS1) 10Jaime Nuche: test [mediawiki-config] (sandbox/jnuche) - 10https://gerrit.wikimedia.org/r/777370
[15:19:22] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+2] test [mediawiki-config] (sandbox/jnuche) - 10https://gerrit.wikimedia.org/r/777370 (owner: 10Jaime Nuche)
[15:19:26] <_joe_>	 that's why we tend to fine-tune alerts when they are introduced
[15:19:30] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5007.eqsin.wmnet with OS buster
[15:19:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:19:40] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp5007.eqsin.wmnet with OS buster com...
[15:19:50] <godog>	 indeed
[15:20:13] <rzl>	 would love if we could also get the rule or at least the DC into the alert message
[15:20:16] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[15:20:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:20:18] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[15:20:18] <jinxer-wm>	 (ProbeDown) resolved: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:20:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:20:20] <rzl>	 er, could get the *URL
[15:20:21] <rzl>	 too early :)
[15:20:25] <logmsgbot>	 !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbstore1003.eqiad.wmnet with reason: host reimage
[15:20:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:20:34] <wikibugs>	 (03PS1) 10Ladsgroup: ParserOutputAccess: Allow calling getPO with option of not saving in PC [core] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/777388 (https://phabricator.wikimedia.org/T285993)
[15:20:46] <Amir1>	 jouncebot: nowandnext
[15:20:46] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 39 minute(s)
[15:20:46] <jouncebot>	 In 0 hour(s) and 39 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220405T1600)
[15:20:50] <Amir1>	 noice
[15:20:55] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es2021.codfw.wmnet with reason: Rebooting for T303174
[15:20:56] <godog>	 ok sending a patch for a longer 'for' clause
[15:20:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:20:57] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es2021.codfw.wmnet with reason: Rebooting for T303174
[15:20:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:21:22] <icinga-wm>	 RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.002675 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[15:21:31] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] ParserOutputAccess: Allow calling getPO with option of not saving in PC [core] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/777388 (https://phabricator.wikimedia.org/T285993) (owner: 10Ladsgroup)
[15:22:01] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: Increase CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/777369 (https://phabricator.wikimedia.org/T305482) (owner: 10Alexandros Kosiaris)
[15:22:12] <wikibugs>	 (03Abandoned) 10Jaime Nuche: test [mediawiki-config] (sandbox/jnuche) - 10https://gerrit.wikimedia.org/r/777363 (owner: 10Jaime Nuche)
[15:22:15] <_joe_>	 godog: I am looking at the logs from lvs and indeed it seems the server that caused the page was depooled at 15:15:19
[15:22:28] <_joe_>	 pybal does the same query as your probe
[15:23:02] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[15:23:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:23:03] <_joe_>	 so I would be surprised if we'd end up with paging if you had even one retry before we run out of backends that are healthy according to pybal
[15:23:15] <mmandere>	 !log pool cp5007 with HAProxy as TLS termination layer - T290005
[15:23:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:23:21] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[15:23:23] <_joe_>	 else it means pybal
[15:23:27] <icinga-wm>	 RECOVERY - purged service on cp4036 is OK: OK - purged is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:23:32] <_joe_>	 is severely overloaded and not checking enough
[15:23:43] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp1088.eqiad.wmnet
[15:23:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:24:45] <godog>	 _joe_: yeah definitely 1m is too short, sending review for 2m now
[15:25:01] <logmsgbot>	 !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1003.eqiad.wmnet with reason: host reimage
[15:25:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:04] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[15:25:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:07] <godog>	 or we can keep 1m but lower availability (i.e. failed probes)
[15:25:25] <_joe_>	 godog: I would go with multiple failed probes before paging
[15:25:27] <_joe_>	 as in
[15:25:43] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[15:25:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:46] <_joe_>	 if we fail for 3 tries in a row 
[15:26:00] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[15:26:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:26:01] <_joe_>	 sadly if we say 2m but then have just one datapoint in that interval
[15:26:26] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[15:26:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:26:39] <godog>	 in this case the probes are sent every 15s though, unlike icinga
[15:26:51] <godog>	 from two prometheus hosts in codfw/eqiad
[15:27:05] <godog>	 going with lower availability is valid too
[15:27:21] <godog>	 git review isn't cooperating
[15:27:59] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4036.ulsfo.wmnet
[15:27:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:28:06] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1001.eqiad.wmnet with reason: host reimage
[15:28:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:28:07] <wikibugs>	 (03PS1) 10Filippo Giunchedi: sre: page after 2m of < 75% avail for network probes [alerts] - 10https://gerrit.wikimedia.org/r/777371
[15:28:36] <godog>	 or more retries, i.e. 1m but say 60% avail
[15:28:57] <wikibugs>	 10SRE, 10WMF-General-or-Unknown, 10WMF-Legal, 10Documentation, and 2 others: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270 (10Ladsgroup) If we can get an explicit approval by legal to license all contributions of wikimedia.org email addresses to apache 2.0. I can start ma...
[15:31:11] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1002.eqiad.wmnet with OS bullseye
[15:31:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:20] <wikibugs>	 (03PS2) 10Zabe: Migrate $wmfUdp2logDest to $wmgUdp2logDest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776258 (https://phabricator.wikimedia.org/T45956)
[15:31:24] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1001.eqiad.wmnet with reason: host reimage
[15:31:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:37] <godog>	 ok going with https://gerrit.wikimedia.org/r/c/operations/alerts/+/777371 for now, unless there are objections
[15:31:40] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1003.eqiad.wmnet with OS bullseye
[15:31:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:51] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1088.eqiad.wmnet
[15:31:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:56] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1004.eqiad.wmnet with OS bullseye
[15:31:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:33:20] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp1089.eqiad.wmnet
[15:33:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:34:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[15:35:39] <wikibugs>	 10SRE, 10WMF-General-or-Unknown, 10WMF-Legal, 10Documentation, and 2 others: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270 (10MoritzMuehlenhoff) >>! In T67270#7832416, @Ladsgroup wrote: > If we can get an explicit approval by legal to license all contributions of wikimedi...
[15:37:07] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] sre: page after 2m of < 75% avail for network probes [alerts] - 10https://gerrit.wikimedia.org/r/777371 (owner: 10Filippo Giunchedi)
[15:39:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] ParserOutputAccess: Allow calling getPO with option of not saving in PC [core] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/777388 (https://phabricator.wikimedia.org/T285993) (owner: 10Ladsgroup)
[15:39:12] <logmsgbot>	 !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbstore1003.eqiad.wmnet with OS bullseye
[15:39:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:54] <wikibugs>	 (03PS1) 10Jaime Nuche: test [mediawiki-config] (sandbox/jnuche) - 10https://gerrit.wikimedia.org/r/777373
[15:39:56] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+2] test [mediawiki-config] (sandbox/jnuche) - 10https://gerrit.wikimedia.org/r/777373 (owner: 10Jaime Nuche)
[15:40:03] <wikibugs>	 (03Merged) 10jenkins-bot: ParserOutputAccess: Allow calling getPO with option of not saving in PC [core] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/777388 (https://phabricator.wikimedia.org/T285993) (owner: 10Ladsgroup)
[15:40:41] <wikibugs>	 (03Merged) 10jenkins-bot: test [mediawiki-config] (sandbox/jnuche) - 10https://gerrit.wikimedia.org/r/777373 (owner: 10Jaime Nuche)
[15:40:45] <moritzm>	 !log drain ganeti2019 T305469
[15:40:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:40:48] <stashbot>	 T305469: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469
[15:41:31] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[15:41:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:41:45] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1089.eqiad.wmnet
[15:41:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:42:04] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.5/includes: Backport: [[gerrit:777388|ParserOutputAccess: Allow calling getPO with option of not saving in PC (T285993)]] (duration: 01m 00s)
[15:42:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:42:07] <stashbot>	 T285993: [SPIKE] Estimate growth in demand for Parser Cache storage - https://phabricator.wikimedia.org/T285993
[15:43:10] <wikibugs>	 (03PS1) 10Herron: sre.kafka.reboot-workers: add logging-codfw targets [cookbooks] - 10https://gerrit.wikimedia.org/r/777375 (https://phabricator.wikimedia.org/T279342)
[15:43:26] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1001.eqiad.wmnet with OS bullseye
[15:43:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:49] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1002.eqiad.wmnet with reason: host reimage
[15:43:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:51] <Amir1>	 ugh, there might be a bit of errors
[15:44:00] <Amir1>	 that is me but should be recovered by now
[15:44:11] <Amir1>	 Cannot access private const
[15:44:12] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1003.eqiad.wmnet with reason: host reimage
[15:44:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:44:42] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1004.eqiad.wmnet with reason: host reimage
[15:44:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:44:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[15:44:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:45:53] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es2024.codfw.wmnet with reason: Rebooting for T303174
[15:45:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:45:55] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es2024.codfw.wmnet with reason: Rebooting for T303174
[15:45:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:46:12] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM, see inline for commit message adjustment" [cookbooks] - 10https://gerrit.wikimedia.org/r/777353 (owner: 10Volans)
[15:46:41] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1002.eqiad.wmnet with reason: host reimage
[15:46:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:46:44] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ml-serve1004.eqiad.wmnet with reason: host reimage
[15:46:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:47:42] <logmsgbot>	 !log razzi@cumin1001 START - Cookbook sre.hosts.remove-downtime for dbstore1003.eqiad.wmnet
[15:47:43] <logmsgbot>	 !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dbstore1003.eqiad.wmnet
[15:47:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:47:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:48:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[15:48:32] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[15:48:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:48:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:49:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[15:49:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:49:21] <logmsgbot>	 !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Upgrade dbstore1005 to bullseye
[15:49:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:49:23] <logmsgbot>	 !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Upgrade dbstore1005 to bullseye
[15:49:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:49:24] <wikibugs>	 (03PS3) 10Volans: sre.cdn.roll-restart-varnish: use Thanos [cookbooks] - 10https://gerrit.wikimedia.org/r/777353
[15:49:29] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1003.eqiad.wmnet with reason: host reimage
[15:49:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:49:40] <wikibugs>	 (03CR) 10Volans: "addressed comment" [cookbooks] - 10https://gerrit.wikimedia.org/r/777353 (owner: 10Volans)
[15:50:46] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 107 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[15:52:29] <logmsgbot>	 !log razzi@cumin1001 START - Cookbook sre.hosts.reimage for host dbstore1005.eqiad.wmnet with OS bullseye
[15:52:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:25] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[15:53:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! Thank you" [cookbooks] - 10https://gerrit.wikimedia.org/r/777353 (owner: 10Volans)
[15:55:32] <icinga-wm>	 PROBLEM - Host ml-serve1004 is DOWN: PING CRITICAL - Packet loss = 100%
[15:56:15] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job calico-felix in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:57:30] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job calico-felix in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:58:26] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1002.eqiad.wmnet with OS bullseye
[15:58:26] <wikibugs>	 (03PS4) 10Volans: sre.cdn.roll-restart-varnish: use Thanos [cookbooks] - 10https://gerrit.wikimedia.org/r/777353
[15:58:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:04] <icinga-wm>	 RECOVERY - Host ml-serve1004 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms
[15:59:25] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1004.eqiad.wmnet with OS bullseye
[15:59:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (3) rsyslog on ml-serve-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[16:00:05] <jouncebot>	 jbond and rzl: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220405T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:01:15] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job calico-felix in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:01:16] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp1090.eqiad.wmnet
[16:01:16] <wikibugs>	 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Infrastructure-Foundations, 10SRE Observability (FY2021/2022-Q3): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10CDanis) New incidents and other posts on the status page will now automatically be...
[16:01:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:01:31] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1003.eqiad.wmnet with OS bullseye
[16:01:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:01:45] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Update the chart to address issues with secrets and CI [deployment-charts] - 10https://gerrit.wikimedia.org/r/777365 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis)
[16:02:02] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Update the chart to address issues with secrets and CI [deployment-charts] - 10https://gerrit.wikimedia.org/r/777365 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis)
[16:02:16] <wikibugs>	 (03PS1) 10Ahmon Dancy: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/777381
[16:02:18] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+2] Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/777381 (owner: 10Ahmon Dancy)
[16:02:20] <logmsgbot>	 !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbstore1005.eqiad.wmnet with reason: host reimage
[16:02:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:02:53] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/777381 (owner: 10Ahmon Dancy)
[16:03:46] <wikibugs>	 (03PS2) 10Ahmon Dancy: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/777381
[16:03:48] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+2] Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/777381 (owner: 10Ahmon Dancy)
[16:04:54] <wikibugs>	 (03Merged) 10jenkins-bot: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/777381 (owner: 10Ahmon Dancy)
[16:05:13] <logmsgbot>	 !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1005.eqiad.wmnet with reason: host reimage
[16:05:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:05:18] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.cdn.roll-restart-varnish: use Thanos [cookbooks] - 10https://gerrit.wikimedia.org/r/777353 (owner: 10Volans)
[16:06:27] <wikibugs>	 (03Merged) 10jenkins-bot: Update the chart to address issues with secrets and CI [deployment-charts] - 10https://gerrit.wikimedia.org/r/777365 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis)
[16:07:46] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[16:07:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:48] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[16:07:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:05] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[16:08:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:07] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[16:08:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:35] <wikibugs>	 (03Merged) 10jenkins-bot: sre.cdn.roll-restart-varnish: use Thanos [cookbooks] - 10https://gerrit.wikimedia.org/r/777353 (owner: 10Volans)
[16:09:18] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1090.eqiad.wmnet
[16:09:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:11:15] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job calico-felix in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:17:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10Cmjohnson) Moved all 3 servers to xe-0/0/28 on their respective switches, and committed the change on homer.
[16:17:50] <wikibugs>	 (03PS1) 10Majavah: O:openstack: add new encapi roles [puppet] - 10https://gerrit.wikimedia.org/r/777385 (https://phabricator.wikimedia.org/T295247)
[16:18:45] <wikibugs>	 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q4:(Need By: TBD) rack/setup/install krb2002 - https://phabricator.wikimedia.org/T305488 (10RobH)
[16:18:47] <wikibugs>	 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: (Need By: TBD) rack/setup/install pki2002 - https://phabricator.wikimedia.org/T305489 (10RobH)
[16:18:56] <wikibugs>	 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q4:(Need By: TBD) rack/setup/install krb2002 - https://phabricator.wikimedia.org/T305488 (10RobH)
[16:19:03] <wikibugs>	 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: (Need By: TBD) rack/setup/install pki2002 - https://phabricator.wikimedia.org/T305489 (10RobH)
[16:19:30] <logmsgbot>	 !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbstore1005.eqiad.wmnet with OS bullseye
[16:19:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:08] <wikibugs>	 (03PS2) 10Majavah: O:openstack: add new encapi roles [puppet] - 10https://gerrit.wikimedia.org/r/777385 (https://phabricator.wikimedia.org/T295247)
[16:20:46] <wikibugs>	 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q4:(Need By: TBD) rack/setup/install pki2002 - https://phabricator.wikimedia.org/T305489 (10RobH)
[16:20:53] <wikibugs>	 (03PS1) 10Razzi: aqs: update mediawiki history snapshot for March 2022 [puppet] - 10https://gerrit.wikimedia.org/r/777407
[16:20:58] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34709/console" [puppet] - 10https://gerrit.wikimedia.org/r/777385 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah)
[16:21:09] <wikibugs>	 (03CR) 10Majavah: O:openstack: add new encapi roles [puppet] - 10https://gerrit.wikimedia.org/r/777385 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah)
[16:23:13] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul)
[16:27:14] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on thanos-be1003 - https://phabricator.wikimedia.org/T304873 (10Cmjohnson) 05Open→03Resolved The disk has been replaced and is back online  cmjohnson@thanos-be1003:~$ sudo megacli -CfgEachDskRaid0 WB RA Direct CachedBadBBU -a0  Adapter 0: Created VD 12 Configured physical...
[16:27:26] <wikibugs>	 (03PS2) 10Krinkle: tests: rename $wmfConfigDir to $configDir [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776253 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[16:28:24] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "LGTM. I updated the commit message as the current name wasn't a mistake. The directory is called wmf-config which seems fair to name as $w" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776253 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[16:28:41] <icinga-wm>	 RECOVERY - MegaRAID on thanos-be1003 is OK: OK: optimal, 14 logical, 14 physical https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:32:17] <logmsgbot>	 !log razzi@cumin1001 START - Cookbook sre.hosts.remove-downtime for dbstore1005.eqiad.wmnet
[16:32:17] <logmsgbot>	 !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dbstore1005.eqiad.wmnet
[16:32:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:32:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:32:54] <logmsgbot>	 !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Upgrade dbstore1007 to bullseye
[16:32:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:32:57] <logmsgbot>	 !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Upgrade dbstore1007 to bullseye
[16:32:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:34:47] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance
[16:34:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:34:49] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance
[16:34:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:34:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1163 (T298565)', diff saved to https://phabricator.wikimedia.org/P24123 and previous config saved to /var/cache/conftool/dbconfig/20220405-163454-ladsgroup.json
[16:34:55] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[16:34:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:34:59] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[16:35:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:03] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[16:35:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:11] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[16:35:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:36:58] <logmsgbot>	 !log razzi@cumin1001 START - Cookbook sre.hosts.reimage for host dbstore1007.eqiad.wmnet with OS bullseye
[16:36:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:37:36] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs.backy2: fix typo in link to runbook for backup_vms [puppet] - 10https://gerrit.wikimedia.org/r/775961 (https://phabricator.wikimedia.org/T304408) (owner: 10Nskaggs)
[16:38:51] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudgw1001.eqiad.wmnet
[16:38:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:39:36] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[16:39:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:39:38] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[16:39:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:39:48] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[16:39:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:39:49] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[16:39:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:41:47] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[16:41:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:41:48] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[16:41:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:42:30] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job calico-felix in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:42:40] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1001.eqiad.wmnet
[16:42:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:43:45] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[16:43:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:43:46] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[16:43:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:46:15] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job calico-felix in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:48:13] <logmsgbot>	 !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbstore1007.eqiad.wmnet with reason: host reimage
[16:48:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:49:32] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[16:49:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:49:35] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudgw1001.eqiad.wmnet
[16:49:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:51:04] <logmsgbot>	 !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1007.eqiad.wmnet with reason: host reimage
[16:51:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:52:36] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1001.eqiad.wmnet
[16:52:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:53:43] <icinga-wm>	 RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:54:09] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1146.eqiad.wmnet with OS buster
[16:54:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:54:14] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host an-worker1146.eqiad.wmnet with OS buster
[16:58:06] <wikibugs>	 (03PS1) 10Ahmon Dancy: train-dev fixups [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/777410
[16:58:08] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+2] train-dev fixups [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/777410 (owner: 10Ahmon Dancy)
[16:58:47] <wikibugs>	 (03Merged) 10jenkins-bot: train-dev fixups [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/777410 (owner: 10Ahmon Dancy)
[16:59:15] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[16:59:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:02:04] <wikibugs>	 (03PS1) 10MSantos: mobileapps: bump to 2022-04-04-120513-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/777412
[17:02:50] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[17:02:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:02:52] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[17:02:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:05:04] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudgw1001.eqiad.wmnet
[17:05:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:05:35] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp1046.eqiad.wmnet
[17:05:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:05:37] <logmsgbot>	 !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbstore1007.eqiad.wmnet with OS bullseye
[17:05:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:18] <logmsgbot>	 !log razzi@cumin1001 START - Cookbook sre.hosts.remove-downtime for dbstore1007.eqiad.wmnet
[17:06:18] <logmsgbot>	 !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dbstore1007.eqiad.wmnet
[17:06:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:08:01] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] httpbb: Delete the git::clone and install via deb package [puppet] - 10https://gerrit.wikimedia.org/r/776977 (https://phabricator.wikimedia.org/T299705) (owner: 10RLazarus)
[17:08:19] <mutante>	 !log wtp1046 - rebooting
[17:08:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:08:38] <wikibugs>	 (03PS1) 10Elukey: Update ml-serve-eqiad's dnscore pod IPs after cluster reinit [puppet] - 10https://gerrit.wikimedia.org/r/777413 (https://phabricator.wikimedia.org/T304673)
[17:09:37] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp1045.eqiad.wmnet
[17:09:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:10:35] <wikibugs>	 (03PS1) 10Elukey: Change ml-serve-eqiad coredns' pod IP after cluster reinit [deployment-charts] - 10https://gerrit.wikimedia.org/r/777414 (https://phabricator.wikimedia.org/T304673)
[17:10:52] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[17:10:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:12:23] <mutante>	 !log serially rebooting hosts in the wtp104* range
[17:12:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:13:40] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1046.eqiad.wmnet
[17:13:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:14:29] <icinga-wm>	 PROBLEM - Host wtp1045 is DOWN: PING CRITICAL - Packet loss = 100%
[17:14:37] <wikibugs>	 (03PS1) 10RLazarus: httpbb: Restore params to absented systemd::timer::job [puppet] - 10https://gerrit.wikimedia.org/r/777415 (https://phabricator.wikimedia.org/T299705)
[17:14:44] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[17:14:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:15:03] <icinga-wm>	 RECOVERY - Host wtp1045 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms
[17:16:07] <wikibugs>	 (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34711/console" [puppet] - 10https://gerrit.wikimedia.org/r/777415 (https://phabricator.wikimedia.org/T299705) (owner: 10RLazarus)
[17:16:40] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp1044.eqiad.wmnet
[17:16:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:16:58] <wikibugs>	 (03CR) 10RLazarus: [V: 03+1 C: 03+2] httpbb: Restore params to absented systemd::timer::job [puppet] - 10https://gerrit.wikimedia.org/r/777415 (https://phabricator.wikimedia.org/T299705) (owner: 10RLazarus)
[17:17:16] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1045.eqiad.wmnet
[17:17:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:18:05] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 108 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[17:18:21] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp1043.eqiad.wmnet
[17:18:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:21:25] <icinga-wm>	 PROBLEM - Host wtp1043 is DOWN: PING CRITICAL - Packet loss = 100%
[17:21:31] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1001.eqiad.wmnet
[17:21:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:21:47] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1044.eqiad.wmnet
[17:21:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:22:27] <icinga-wm>	 RECOVERY - Host wtp1043 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms
[17:22:59] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4021.ulsfo.wmnet
[17:23:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:23:35] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1146.eqiad.wmnet with OS buster
[17:23:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:23:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host an-worker1146.eqiad.wmnet with OS buster execut...
[17:23:43] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp1042.eqiad.wmnet
[17:23:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:23:48] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1043.eqiad.wmnet
[17:23:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:24:00] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp6001.drmrs.wmnet
[17:24:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:25:04] <wikibugs>	 (03PS1) 10Eigyan: [config]: Undeploy GDI survey from EN,FR and ES wikis in PROD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777416 (https://phabricator.wikimedia.org/T303962)
[17:25:26] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[17:25:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:26:05] <icinga-wm>	 PROBLEM - Check systemd state on dbstore1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:26:33] <icinga-wm>	 PROBLEM - Host wtp1042 is DOWN: PING CRITICAL - Packet loss = 100%
[17:27:17] <icinga-wm>	 ACKNOWLEDGEMENT - Host wtp1042 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn reboot
[17:27:41] <icinga-wm>	 RECOVERY - Host wtp1042 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms
[17:28:26] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp1041.eqiad.wmnet
[17:28:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:28:32] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1042.eqiad.wmnet
[17:28:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:29:25] <icinga-wm>	 RECOVERY - Check systemd state on dbstore1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:29:28] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4021.ulsfo.wmnet
[17:29:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:30:25] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4033.ulsfo.wmnet
[17:30:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:31:05] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6001.drmrs.wmnet
[17:31:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:31:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T298565)', diff saved to https://phabricator.wikimedia.org/P24124 and previous config saved to /var/cache/conftool/dbconfig/20220405-173143-ladsgroup.json
[17:31:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:31:47] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[17:31:51] <icinga-wm>	 PROBLEM - Host wtp1041 is DOWN: PING CRITICAL - Packet loss = 100%
[17:32:17] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp1040.eqiad.wmnet
[17:32:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:32:49] <icinga-wm>	 RECOVERY - Host wtp1041 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[17:33:18] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1041.eqiad.wmnet
[17:33:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:33:58] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw: relocate dataplane-specific sysctl params to ifupdown [puppet] - 10https://gerrit.wikimedia.org/r/777418 (https://phabricator.wikimedia.org/T305494)
[17:34:15] <wikibugs>	 (03PS1) 10Btullis: Add the networkpoliy for the setups as a pre-install hook [deployment-charts] - 10https://gerrit.wikimedia.org/r/777419 (https://phabricator.wikimedia.org/T301454)
[17:34:46] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cloudgw: relocate dataplane-specific sysctl params to ifupdown [puppet] - 10https://gerrit.wikimedia.org/r/777418 (https://phabricator.wikimedia.org/T305494) (owner: 10Arturo Borrero Gonzalez)
[17:34:48] <wikibugs>	 (03PS2) 10Btullis: Add the networkpolicy for the setups as a pre-install hook [deployment-charts] - 10https://gerrit.wikimedia.org/r/777419 (https://phabricator.wikimedia.org/T301454)
[17:35:26] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloudgw: relocate dataplane-specific sysctl params to ifupdown [puppet] - 10https://gerrit.wikimedia.org/r/777418 (https://phabricator.wikimedia.org/T305494)
[17:35:51] <icinga-wm>	 PROBLEM - Host wtp1040 is DOWN: PING CRITICAL - Packet loss = 100%
[17:36:01] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cloudgw: relocate dataplane-specific sysctl params to ifupdown [puppet] - 10https://gerrit.wikimedia.org/r/777418 (https://phabricator.wikimedia.org/T305494) (owner: 10Arturo Borrero Gonzalez)
[17:36:09] <icinga-wm>	 RECOVERY - Host wtp1040 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms
[17:36:40] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: cloudgw: relocate dataplane-specific sysctl params to ifupdown [puppet] - 10https://gerrit.wikimedia.org/r/777418 (https://phabricator.wikimedia.org/T305494)
[17:36:59] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1040.eqiad.wmnet
[17:37:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:37:35] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4033.ulsfo.wmnet
[17:37:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:39:40] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: cloudgw: relocate dataplane-specific sysctl params to ifupdown [puppet] - 10https://gerrit.wikimedia.org/r/777418 (https://phabricator.wikimedia.org/T305494)
[17:40:20] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4035.ulsfo.wmnet
[17:40:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:40:36] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp6002.drmrs.wmnet
[17:40:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:41:00] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1003/34712/" [puppet] - 10https://gerrit.wikimedia.org/r/777418 (https://phabricator.wikimedia.org/T305494) (owner: 10Arturo Borrero Gonzalez)
[17:41:44] <wikibugs>	 (03CR) 10Majavah: cloudgw: relocate dataplane-specific sysctl params to ifupdown (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777418 (https://phabricator.wikimedia.org/T305494) (owner: 10Arturo Borrero Gonzalez)
[17:45:34] <wikibugs>	 (03CR) 10MSantos: [C: 03+2] mobileapps: bump to 2022-04-04-120513-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/777412 (owner: 10MSantos)
[17:45:43] <wikibugs>	 (03PS1) 10Herron: spicerack: add logging clusters to elasticsearch config [puppet] - 10https://gerrit.wikimedia.org/r/777421 (https://phabricator.wikimedia.org/T255864)
[17:46:25] <wikibugs>	 (03PS5) 10Arturo Borrero Gonzalez: cloudgw: relocate dataplane-specific sysctl params to ifupdown [puppet] - 10https://gerrit.wikimedia.org/r/777418 (https://phabricator.wikimedia.org/T305494)
[17:46:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P24125 and previous config saved to /var/cache/conftool/dbconfig/20220405-174648-ladsgroup.json
[17:46:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:47:09] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: cloudgw: relocate dataplane-specific sysctl params to ifupdown (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777418 (https://phabricator.wikimedia.org/T305494) (owner: 10Arturo Borrero Gonzalez)
[17:48:06] <wikibugs>	 (03CR) 10Herron: "something to get the ball rolling, AFAICT we'll need to decide if we want to open up ferm access direct from the cumin hosts, or use a pro" [puppet] - 10https://gerrit.wikimedia.org/r/777421 (https://phabricator.wikimedia.org/T255864) (owner: 10Herron)
[17:48:23] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6002.drmrs.wmnet
[17:48:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:48:25] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4035.ulsfo.wmnet
[17:48:25] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1001/34713/" [puppet] - 10https://gerrit.wikimedia.org/r/777418 (https://phabricator.wikimedia.org/T305494) (owner: 10Arturo Borrero Gonzalez)
[17:48:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:48:43] <wikibugs>	 (03PS2) 10Herron: spicerack: add logging clusters to elasticsearch config [puppet] - 10https://gerrit.wikimedia.org/r/777421 (https://phabricator.wikimedia.org/T255864)
[17:49:29] <icinga-wm>	 RECOVERY - Check systemd state on thanos-be1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:49:33] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp6003.drmrs.wmnet
[17:49:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:49:49] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: bump to 2022-04-04-120513-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/777412 (owner: 10MSantos)
[17:51:18] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=parse2020.wmnet
[17:51:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:51:40] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] cloudgw: relocate dataplane-specific sysctl params to ifupdown [puppet] - 10https://gerrit.wikimedia.org/r/777418 (https://phabricator.wikimedia.org/T305494) (owner: 10Arturo Borrero Gonzalez)
[17:51:58] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=parse201[7-9].wmnet
[17:51:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:52:25] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=parse201[7-9].codfw.wmnet
[17:52:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:53:14] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudgw2001-dev.codfw.wmnet
[17:53:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:54:42] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.reboot-single for host parse2020.codfw.wmnet
[17:54:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:55:15] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:56:14] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudgw1001.eqiad.wmnet
[17:56:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:56:43] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul)
[17:56:59] <wikibugs>	 (03PS5) 10Herron: ipmiseld: ensure service enabled and running [puppet] - 10https://gerrit.wikimedia.org/r/775875 (https://phabricator.wikimedia.org/T305147)
[17:57:21] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 47967 bytes in 3.681 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:57:33] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw2001-dev.codfw.wmnet
[17:57:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:58:02] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6003.drmrs.wmnet
[17:58:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:58:10] <mutante>	 !log rebooting hosts in the parse201* range, starting with parse2019, counting down
[17:58:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:58:26] <wikibugs>	 (03CR) 10Herron: ipmiseld: ensure service enabled and running (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/775875 (https://phabricator.wikimedia.org/T305147) (owner: 10Herron)
[17:58:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10cmooney) ^^ Above reimage seemed to fail due to some disk problem, I suspect maybe the raid config needs to be done in the BIOS (I was running...
[17:58:57] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=parse2020.codfw.wmnet
[17:58:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:59:38] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudgw2001-dev.codfw.wmnet
[17:59:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:59:48] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host parse2020.codfw.wmnet
[17:59:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:00:26] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1001.eqiad.wmnet
[18:00:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:00:33] <icinga-wm>	 PROBLEM - Host parse2019 is DOWN: PING CRITICAL - Packet loss = 100%
[18:01:29] <icinga-wm>	 RECOVERY - Host parse2019 is UP: PING OK - Packet loss = 0%, RTA = 33.18 ms
[18:01:39] <icinga-wm>	 PROBLEM - Host parse2018 is DOWN: PING CRITICAL - Packet loss = 100%
[18:01:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P24126 and previous config saved to /var/cache/conftool/dbconfig/20220405-180153-ladsgroup.json
[18:01:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:03:23] <icinga-wm>	 RECOVERY - Host parse2018 is UP: PING OK - Packet loss = 0%, RTA = 33.21 ms
[18:05:01] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw2001-dev.codfw.wmnet
[18:05:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:05:12] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul)
[18:07:09] <icinga-wm>	 PROBLEM - Host parse2017 is DOWN: PING CRITICAL - Packet loss = 100%
[18:08:17] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp6004.drmrs.wmnet
[18:08:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:08:37] <icinga-wm>	 RECOVERY - Host parse2017 is UP: PING OK - Packet loss = 0%, RTA = 33.12 ms
[18:12:24] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[18:15:52] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6004.drmrs.wmnet
[18:15:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:17:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T298565)', diff saved to https://phabricator.wikimedia.org/P24127 and previous config saved to /var/cache/conftool/dbconfig/20220405-181658-ladsgroup.json
[18:17:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:17:03] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[18:17:05] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance
[18:17:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:17:07] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance
[18:17:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:17:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1163 (T298565)', diff saved to https://phabricator.wikimedia.org/P24128 and previous config saved to /var/cache/conftool/dbconfig/20220405-181712-ladsgroup.json
[18:17:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:17:52] <wikibugs>	 (03PS1) 10RLazarus: Make the package name consistent [software/httpbb] - 10https://gerrit.wikimedia.org/r/777422
[18:18:28] <wikibugs>	 (03PS2) 10RLazarus: Make the package name consistent [software/httpbb] - 10https://gerrit.wikimedia.org/r/777422 (https://phabricator.wikimedia.org/T299705)
[18:19:50] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] Make the package name consistent [software/httpbb] - 10https://gerrit.wikimedia.org/r/777422 (https://phabricator.wikimedia.org/T299705) (owner: 10RLazarus)
[18:20:53] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) @Marostegui or anyone in the DB team I am planning on moving all the db nodes in Rack B1 to Rack B5 but please see detail in the table in the description. I have db2109 already in...
[18:21:03] <wikibugs>	 (03Merged) 10jenkins-bot: Make the package name consistent [software/httpbb] - 10https://gerrit.wikimedia.org/r/777422 (https://phabricator.wikimedia.org/T299705) (owner: 10RLazarus)
[18:22:09] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=parse2016.codfw.wmnet
[18:22:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:22:36] <wikibugs>	 (03PS1) 10JHathaway: mx: reject email to legacy mailing list domains [puppet] - 10https://gerrit.wikimedia.org/r/777423 (https://phabricator.wikimedia.org/T280472)
[18:22:58] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=codfw,name=parse2020.codfw.wmnet
[18:22:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:23:28] <icinga-wm>	 PROBLEM - Host parse2016 is DOWN: PING CRITICAL - Packet loss = 100%
[18:23:34] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=parse2015.codfw.wmnet
[18:23:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:24:15] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=codfw,name=parse2019.codfw.wmnet
[18:24:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:24:18] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp6005.drmrs.wmnet
[18:24:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:24:20] <icinga-wm>	 RECOVERY - Host parse2016 is UP: PING OK - Packet loss = 0%, RTA = 33.19 ms
[18:24:36] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=codfw,name=parse2018.codfw.wmnet
[18:24:37] <wikibugs>	 (03PS2) 10JHathaway: mx: reject email to legacy mailing list domains [puppet] - 10https://gerrit.wikimedia.org/r/777423 (https://phabricator.wikimedia.org/T280472)
[18:24:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:25:18] <icinga-wm>	 PROBLEM - Host parse2015 is DOWN: PING CRITICAL - Packet loss = 100%
[18:25:28] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=codfw,name=parse2017.codfw.wmnet
[18:25:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:26:50] <icinga-wm>	 RECOVERY - Host parse2015 is UP: PING OK - Packet loss = 0%, RTA = 33.20 ms
[18:28:34] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=codfw,name=parse2016.codfw.wmnet
[18:28:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:28:39] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=codfw,name=parse2015.codfw.wmnet
[18:28:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:28:45] <rzl>	 !log rzl@apt1001:~$ sudo -i reprepro -C main include buster-wikimedia /home/rzl/httpbb/buster/httpbb_0.0.1-1_amd64.changes  # T299705
[18:28:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:28:49] <stashbot>	 T299705: Debian package for httpbb - https://phabricator.wikimedia.org/T299705
[18:29:06] <wikibugs>	 (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/777423 (https://phabricator.wikimedia.org/T280472) (owner: 10JHathaway)
[18:30:03] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10observability, 10Patch-For-Review, 10User-MoritzMuehlenhoff: ipmiseld not running reliably - https://phabricator.wikimedia.org/T305147 (10herron) >>! In T305147#7824394, @MoritzMuehlenhoff wrote: > There's some things which are still puzzling here: Why wasn't this n...
[18:31:22] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6005.drmrs.wmnet
[18:31:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:34:28] <rzl>	 !log rzl@apt1001:~$ sudo -i reprepro -C main include bullseye-wikimedia /home/rzl/httpbb/bullseye/httpbb_0.0.1-1+deb11u1_amd64.changes
[18:34:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:37:21] <logmsgbot>	 !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host mwlog2002.codfw.wmnet
[18:37:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:37:46] <jinxer-wm>	 (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs2003:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC
[18:41:33] <logmsgbot>	 !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwlog2002.codfw.wmnet
[18:41:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:41:35] <wikibugs>	 10SRE, 10User-Marostegui, 10User-fgiunchedi: Audit "misc" cluster hosts - https://phabricator.wikimedia.org/T210486 (10Umherirrender)
[18:41:51] <wikibugs>	 10SRE: prometheus: ganglia-gen and outdated Ganglia:cluster resource name - https://phabricator.wikimedia.org/T186918 (10Umherirrender)
[18:42:11] <logmsgbot>	 !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host mwlog1002.eqiad.wmnet
[18:42:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:42:22] <wikibugs>	 10SRE, 10Phabricator: Switch phabricator from using apache to nginx - https://phabricator.wikimedia.org/T185644 (10Umherirrender)
[18:42:27] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp6006.drmrs.wmnet
[18:42:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:43:17] <wikibugs>	 10SRE: operations/software repo: flake8 check - https://phabricator.wikimedia.org/T178877 (10Umherirrender) 05Open→03Resolved
[18:43:19] <wikibugs>	 10SRE, 10User-Joe: etcd switchover/enhancements - https://phabricator.wikimedia.org/T159687 (10Umherirrender)
[18:43:35] <wikibugs>	 10SRE: Setup PAWS internal experimentally on notebook* nodes - https://phabricator.wikimedia.org/T149543 (10Umherirrender)
[18:43:57] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] gitlab: fix duplicate backup_dir hiera key [puppet] - 10https://gerrit.wikimedia.org/r/777359 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[18:43:57] <wikibugs>	 10SRE, 10Wikimedia-Apache-configuration: catch-all apache vhost on the cluster should return 404 for non-existing sites - https://phabricator.wikimedia.org/T137176 (10Umherirrender)
[18:44:22] <wikibugs>	 10SRE, 10IRCecho, 10Wikimedia-IRC-RC-Server: udpmxircecho spam/not working if unable to connect to irc server - https://phabricator.wikimedia.org/T134875 (10Umherirrender)
[18:44:58] <wikibugs>	 10SRE, 10Deployments: Make l10nupdate user a system user - https://phabricator.wikimedia.org/T120585 (10Umherirrender)
[18:45:19] <wikibugs>	 10SRE: Nutcracker stats monitoring should only listen on localhost - https://phabricator.wikimedia.org/T111934 (10Umherirrender)
[18:45:52] <icinga-wm>	 PROBLEM - Check systemd state on webperf2002 is CRITICAL: CRITICAL - degraded: The following units failed: excimer-k8s-log.service,excimer-k8s-wall-log.service,excimer-log.service,excimer-wall-log.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:45:54] <wikibugs>	 10SRE: Rename 'restricted' group? - https://phabricator.wikimedia.org/T104671 (10Umherirrender)
[18:46:42] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: fix duplicate backup_dir hiera key [puppet] - 10https://gerrit.wikimedia.org/r/777359 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[18:46:54] <icinga-wm>	 PROBLEM - Check systemd state on webperf1002 is CRITICAL: CRITICAL - degraded: The following units failed: excimer-k8s-log.service,excimer-k8s-wall-log.service,excimer-log.service,excimer-wall-log.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:47:32] <logmsgbot>	 !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwlog1002.eqiad.wmnet
[18:47:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:47:48] <icinga-wm>	 RECOVERY - Check systemd state on webperf2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:48:12] <wikibugs>	 10SRE, 10cloud-services-team (Kanban): Fix all .erb variable warnings - https://phabricator.wikimedia.org/T97251 (10Umherirrender)
[18:48:26] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:49:39] <wikibugs>	 (03CR) 10Scardenasmolinar: [C: 03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777416 (https://phabricator.wikimedia.org/T303962) (owner: 10Eigyan)
[18:50:24] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:50:25] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6006.drmrs.wmnet
[18:50:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:51:18] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:51:56] <rzl>	 👋
[18:52:06] <jhathaway>	 👋
[18:52:11] <herron>	 hey
[18:52:15] <cdanis>	 this is a new alert, right?
[18:52:22] <godog>	 here too
[18:52:24] <mutante>	 I thought it was my IRC shell box
[18:52:26] <mutante>	 apparently not
[18:52:32] * volans here
[18:52:32] <godog>	 cdanis: correct
[18:52:33] <rzl>	 cdanis: it is yeah
[18:52:37] <cdanis>	 is it expected that the only alert I see on https://alerts.wikimedia.org/?q=alertname%3DProbeDown is for a different service entirely?
[18:52:40] <volans>	 yes is the new one, might be a false alarm
[18:53:15] <cdanis>	 oh
[18:53:16] <volans>	 cdanis: I see it
[18:53:19] <cdanis>	 apparently I was holding it wrong?
[18:53:28] <cdanis>	 it only showed service inference:30443 until I pressed a red button
[18:53:41] <rzl>	 it didn't show up the first time I clicked and then it did show up the second time? might be inconsistent
[18:53:56] <volans>	 target=https://[10.2.2.5]:443/w/health-check.php msg="Error for HTTP request" err="Get \"https://10.2.2.5:443/w/health-check.php\": context deadline exceeded"
[18:53:58] <godog>	 yeah it should show up
[18:54:18] <volans>	     godog is the timeout set at 2.5s?
[18:54:19] <godog>	 it does for me now fwiw, I see from the dashboard that videoscaler is flapping
[18:54:23] <volans>	 msg="Probe failed" duration_seconds=2.500589415
[18:54:34] <godog>	 volans: IIRC yes
[18:55:47] <godog>	 so yeah looks like videoscaler is slow according to the probe anyways, not sure we're in trouble tho
[18:56:00] <jhathaway>	 some seem to take as long 7secs in my brief testing with curl
[18:56:18] <jinxer-wm>	 (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:56:28] <mutante>	 it switched to zotero?
[18:56:43] <volans>	 [question] those new alerts, did were live without the page bit before? or we were just collecting the data?
[18:57:01] <godog>	 volans: they were live, though as warnings
[18:57:15] <wikibugs>	 10SRE: Logrotate fails for: "$FILE No such file or directory" - https://phabricator.wikimedia.org/T153940 (10Umherirrender)
[18:57:16] <volans>	 do we have any stats on how much they were/will be firing?
[18:57:50] <volans>	 because they seem (at least so far) much more sensitive than the existing ones AFAICT
[18:57:51] <godog>	 yes we have the logs, though looking back in the dashboards also will say
[18:58:25] <godog>	 agreed, I'll revert the paging change for now and ask questions later
[18:59:24] <godog>	 "revert" actually I'll switch to critical from paging
[18:59:30] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Ladsgroup) FWIW: |db2076|sanitarium master for s6 |db2086|core multiinsace: s7, s8 |db2107|candidate master for s2 |db2137|core multiinsace: s4, s5 |db2143|x2 replica |db2147|s4 replica  A...
[18:59:31] <volans>	 SGTM
[18:59:46] <wikibugs>	 (03PS1) 10Filippo Giunchedi: sre: move network probes to critical from page [alerts] - 10https://gerrit.wikimedia.org/r/777429
[18:59:48] <rzl>	 +1 thanks godog 
[18:59:57] <volans>	 just ignore the page bit for now should be enough to get a generic sense of how they go for now
[19:00:06] <volans>	 *ignoring
[19:00:26] <godog>	 agreed, sorry folks didn't begin as smooth as I planned!
[19:00:36] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM (logically, I'm not that familiar with the current abstraction)" [alerts] - 10https://gerrit.wikimedia.org/r/777429 (owner: 10Filippo Giunchedi)
[19:01:35] <wikibugs>	 (03PS6) 10Bking: elastic: don't wait for green on first node [software/spicerack] - 10https://gerrit.wikimedia.org/r/776999 (https://phabricator.wikimedia.org/T304570)
[19:02:23] <jhathaway>	 godog: no problem, there are always bumps on the road to making things better!
[19:02:49] <godog>	 heheh indeed jhathaway, thank you
[19:03:31] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab100[3|4] and gitlab-runner100[2|3|4] - https://phabricator.wikimedia.org/T301177 (10Jclark-ctr)
[19:03:46] <wikibugs>	 (03CR) 10Herron: [C: 03+1] sre: move network probes to critical from page [alerts] - 10https://gerrit.wikimedia.org/r/777429 (owner: 10Filippo Giunchedi)
[19:03:53] <wikibugs>	 (03PS2) 10Filippo Giunchedi: sre: move network probes to critical from page [alerts] - 10https://gerrit.wikimedia.org/r/777429
[19:04:04] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] sre: move network probes to critical from page [alerts] - 10https://gerrit.wikimedia.org/r/777429 (owner: 10Filippo Giunchedi)
[19:04:59] <godog>	 basically this https://i.redd.it/uziifr83woo81.jpg
[19:05:27] <wikibugs>	 (03PS7) 10Ryan Kemper: elastic: don't wait for green on first node [software/spicerack] - 10https://gerrit.wikimedia.org/r/776999 (https://phabricator.wikimedia.org/T304570) (owner: 10Bking)
[19:05:32] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1310 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[19:06:08] <wikibugs>	 (03PS1) 10RLazarus: httpbb: Add force => true to properly delete the old $install_dir [puppet] - 10https://gerrit.wikimedia.org/r/777430 (https://phabricator.wikimedia.org/T299705)
[19:06:52] <godog>	 waiting a few seconds for the change to be rolled out to the prometheus hosts then going afk
[19:07:36] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1310 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 7.360 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[19:08:59] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp6009.drmrs.wmnet
[19:09:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:09:05] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] httpbb: Add force => true to properly delete the old $install_dir [puppet] - 10https://gerrit.wikimedia.org/r/777430 (https://phabricator.wikimedia.org/T299705) (owner: 10RLazarus)
[19:09:27] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10wiki_willy)
[19:09:54] <icinga-wm>	 RECOVERY - Check systemd state on webperf1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:14:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elastic: don't wait for green on first node [software/spicerack] - 10https://gerrit.wikimedia.org/r/776999 (https://phabricator.wikimedia.org/T304570) (owner: 10Bking)
[19:16:18] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6009.drmrs.wmnet
[19:16:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:17:18] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:17:24] <wikibugs>	 10SRE: should we make privatewiki list available to puppet without maintaining two lists? - https://phabricator.wikimedia.org/T152100 (10Umherirrender)
[19:22:18] <jinxer-wm>	 (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:22:51] <XioNoX>	 godog: can you remove the #  page?
[19:24:18] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:24:37] <rzl>	 XioNoX: I think I see where to remove it, giving it a try
[19:24:55] <XioNoX>	 rzl: thanks happy to review the change
[19:26:25] <wikibugs>	 (03PS1) 10RLazarus: sre: Remove paging hotword from network probes [alerts] - 10https://gerrit.wikimedia.org/r/777432
[19:28:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T298565)', diff saved to https://phabricator.wikimedia.org/P24129 and previous config saved to /var/cache/conftool/dbconfig/20220405-192800-ladsgroup.json
[19:28:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:28:05] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[19:29:03] <rzl>	 XioNoX: ready, https://gerrit.wikimedia.org/r/777432
[19:29:12] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp6010.drmrs.wmnet
[19:29:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:29:18] <jinxer-wm>	 (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:29:47] <XioNoX>	 rzl: lgtm!
[19:29:49] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] sre: Remove paging hotword from network probes [alerts] - 10https://gerrit.wikimedia.org/r/777432 (owner: 10RLazarus)
[19:30:30] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] sre: Remove paging hotword from network probes [alerts] - 10https://gerrit.wikimedia.org/r/777432 (owner: 10RLazarus)
[19:32:39] <wikibugs>	 (03Merged) 10jenkins-bot: sre: Remove paging hotword from network probes [alerts] - 10https://gerrit.wikimedia.org/r/777432 (owner: 10RLazarus)
[19:33:14] <rzl>	 manually running puppet on P:alerts::deploy::prometheus to avoid bothering folks over the next 30 min
[19:35:03] <volans>	 thanls rzl 
[19:35:16] <wikibugs>	 (03CR) 10Herron: "Looks good to me overall, will be great to shed these legacy addresses.  Please see question inline, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/777423 (https://phabricator.wikimedia.org/T280472) (owner: 10JHathaway)
[19:36:36] <rzl>	 done 🤞
[19:36:38] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6010.drmrs.wmnet
[19:36:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:37:57] <wikibugs>	 (03PS1) 10Zabe: postgresql: migrate backup crons to systemd timer jobs [puppet] - 10https://gerrit.wikimedia.org/r/777433 (https://phabricator.wikimedia.org/T273673)
[19:38:01] <wikibugs>	 (03PS1) 10Zabe: postgresql: remove absented backup crons [puppet] - 10https://gerrit.wikimedia.org/r/777434 (https://phabricator.wikimedia.org/T273673)
[19:38:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] postgresql: migrate backup crons to systemd timer jobs [puppet] - 10https://gerrit.wikimedia.org/r/777433 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[19:38:30] <wikibugs>	 (03CR) 10JHathaway: mx: reject email to legacy mailing list domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777423 (https://phabricator.wikimedia.org/T280472) (owner: 10JHathaway)
[19:38:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] postgresql: remove absented backup crons [puppet] - 10https://gerrit.wikimedia.org/r/777434 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[19:41:01] <wikibugs>	 (03PS2) 10Zabe: postgresql: migrate backup crons to systemd timer jobs [puppet] - 10https://gerrit.wikimedia.org/r/777433 (https://phabricator.wikimedia.org/T273673)
[19:42:18] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:43:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P24130 and previous config saved to /var/cache/conftool/dbconfig/20220405-194305-ladsgroup.json
[19:43:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:43:18] <rzl>	 ^ dropping the hotword worked, at least
[19:46:07] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp6011.drmrs.wmnet
[19:46:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:47:18] <jinxer-wm>	 (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:49:48] <logmsgbot>	 !log rzl@cumin2002 conftool action : set/pooled=no; selector: cluster=videoscaler,name=mw(1307|1308|1309|1310|1311|1318|1334|1335|1336|1337).*
[19:49:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:55:14] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6011.drmrs.wmnet
[19:55:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:56:17] <wikibugs>	 (03PS3) 10Zabe: postgresql: migrate backup crons to systemd timer jobs [puppet] - 10https://gerrit.wikimedia.org/r/777433 (https://phabricator.wikimedia.org/T273673)
[19:56:56] <wikibugs>	 (03PS4) 10Zabe: postgresql: migrate backup crons to systemd timer jobs [puppet] - 10https://gerrit.wikimedia.org/r/777433 (https://phabricator.wikimedia.org/T273673)
[19:57:32] <duesen>	 I'm trying to figure out how to make the deployment server kick me out after 15 minutes of inactivity. But TMOUT is set readonly, to much higher value... is there a way to lower it?
[19:58:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P24131 and previous config saved to /var/cache/conftool/dbconfig/20220405-195810-ladsgroup.json
[19:58:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:59:24] <bd808>	 duesen: if you find a way to lower it, let me know so I can use the same trick to raise it for my user. ;)
[20:00:05] <jouncebot>	 RoanKattouw and Urbanecm: That opportune time is upon us again. Time for a UTC late backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220405T2000).
[20:00:05] <jouncebot>	 eigyan: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:11] <eigyan>	 greetings all
[20:00:20] <duesen>	 bd808: raising it is easy: exec env TMOUT=0 bash
[20:00:56] <RoanKattouw>	 Hi eigyan ! I'll be ready to do the deployment in about 15 minutes, apologies for the delay
[20:01:02] <wikibugs>	 (03CR) 10Zabe: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1003/34715/" [puppet] - 10https://gerrit.wikimedia.org/r/777433 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[20:01:08] <eigyan>	 take your time RoanKattouw
[20:01:12] <eigyan>	 :)
[20:01:17] <RoanKattouw>	 It's lunchtime and I need to eat :)
[20:02:08] <duesen>	 ...so, considering that it's actually easy to raise the TMOUT, can we just not make it readonly, so I can also lower it?...
[20:02:49] <duesen>	 oh wait, actually, does this work? exec env TMOUT=300 bash && exit
[20:02:53] <duesen>	 let me try that
[20:04:34] <zabe>	 I added two patches to the window, hope thats ok
[20:05:11] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp6012.drmrs.wmnet
[20:05:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:05:16] <hauskatze>	 zabe: or there may be mices around :P
[20:05:16] <wikibugs>	 10SRE, 10Data-Engineering, 10Product-Analytics, 10wmfdata-python: wmfdata.mariadb relies on analytics-mysql being available - https://phabricator.wikimedia.org/T292479 (10JArguello-WMF)
[20:05:39] <duesen>	 heh, that does work. nice.
[20:05:52] <zabe>	 :p
[20:06:14] <duesen>	 I'll just put that into my .profile, then :P
[20:13:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T298565)', diff saved to https://phabricator.wikimedia.org/P24132 and previous config saved to /var/cache/conftool/dbconfig/20220405-201315-ladsgroup.json
[20:13:17] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[20:13:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:13:18] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[20:13:19] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[20:13:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:13:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:14:23] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6012.drmrs.wmnet
[20:14:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:18:37] * urbanecm waves
[20:18:44] <urbanecm>	 eigyan: RoanKattouw: did deployment already start?
[20:18:50] <urbanecm>	 (if not, i can do it)
[20:18:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10Cmjohnson) @cmooney Can you confirm the raid setup please. analytics-flex is first 2 ssds are raid 1 and the rest jbod?
[20:23:37] <eigyan>	 @urba
[20:23:41] <wikibugs>	 (03CR) 10Dzahn: postgresql: migrate backup crons to systemd timer jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777433 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[20:23:54] <eigyan>	 urbanecm I don't think so
[20:24:04] <urbanecm>	 in that case, let's do it
[20:24:06] <urbanecm>	 hello eigyan 
[20:24:10] <urbanecm>	 sorry for the lateness
[20:24:32] <urbanecm>	 and hello zabe! just saw you have a patch too
[20:24:33] <eigyan>	 Greetings urbanecm RoanKattouw was just having some lunch
[20:24:43] <urbanecm>	 yep, saw that in the scrollback
[20:24:49] <urbanecm>	 I'll deploy today
[20:24:59] <zabe>	 hey
[20:25:01] <eigyan>	 urbanecm perfect...lets rock!
[20:25:11] <wikibugs>	 (03PS1) 10RLazarus: httpbb: Clean up absented objects [puppet] - 10https://gerrit.wikimedia.org/r/777442 (https://phabricator.wikimedia.org/T299705)
[20:25:23] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] [config]: Undeploy GDI survey from EN,FR and ES wikis in PROD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777416 (https://phabricator.wikimedia.org/T303962) (owner: 10Eigyan)
[20:26:06] <wikibugs>	 (03Merged) 10jenkins-bot: [config]: Undeploy GDI survey from EN,FR and ES wikis in PROD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777416 (https://phabricator.wikimedia.org/T303962) (owner: 10Eigyan)
[20:26:18] <wikibugs>	 (03PS2) 10RLazarus: httpbb: Clean up absented objects [puppet] - 10https://gerrit.wikimedia.org/r/777442 (https://phabricator.wikimedia.org/T299705)
[20:27:11] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] httpbb: Clean up absented objects [puppet] - 10https://gerrit.wikimedia.org/r/777442 (https://phabricator.wikimedia.org/T299705) (owner: 10RLazarus)
[20:27:27] <urbanecm>	 eigyan: i pulled it to mwdebug1001. can you test please?
[20:27:46] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "assuming puppet already ran on everything that could have this" [puppet] - 10https://gerrit.wikimedia.org/r/777442 (https://phabricator.wikimedia.org/T299705) (owner: 10RLazarus)
[20:27:46] <eigyan>	 urbanecm testing now thanks!
[20:27:50] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp6013.drmrs.wmnet
[20:27:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:27:56] <RoanKattouw>	 Thanks urbanecm ! I just got back from lunch but I see you've got it
[20:28:20] <urbanecm>	 no problem RoanKattouw. i hope the lunch was good :)
[20:30:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:30:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:30:42] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "really checked with 'sudo cumin 'C:profile::httpbb' 'file /srv/deployment/httpbb'' 4 hosts and none have the dir :)" [puppet] - 10https://gerrit.wikimedia.org/r/777442 (https://phabricator.wikimedia.org/T299705) (owner: 10RLazarus)
[20:31:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:31:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:31:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:32:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:33] <eigyan>	 urbanecm patch is working as expected. Thanks!
[20:32:38] <urbanecm>	 syncing
[20:33:35] <wikibugs>	 (03PS3) 10Urbanecm: tests: rename $wmfConfigDir to $configDir [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776253 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[20:33:40] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] tests: rename $wmfConfigDir to $configDir [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776253 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[20:33:58] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 10c16c5ed46014ec6f5e771f84320441974bef6c: [config]: Undeploy GDI survey from EN,FR and ES wikis in PROD (T303962) (duration: 00m 55s)
[20:34:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:34:01] <stashbot>	 T303962: Undeploy Safety Survey for EN, ES, FR wikis FROM PRODUCTION - https://phabricator.wikimedia.org/T303962
[20:34:04] <urbanecm>	 eigyan: and live
[20:34:06] <urbanecm>	 anything else?
[20:34:21] <wikibugs>	 (03Merged) 10jenkins-bot: tests: rename $wmfConfigDir to $configDir [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776253 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[20:34:25] <eigyan>	 urbanecm I am all good here, thank you so much!
[20:34:32] <urbanecm>	 happy to help!
[20:34:41] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/777442 (https://phabricator.wikimedia.org/T299705) (owner: 10RLazarus)
[20:34:44] <urbanecm>	 zabe: first patch only changes tests, so no testing etc. needed
[20:34:47] <urbanecm>	 looking at the second one
[20:35:12] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6013.drmrs.wmnet
[20:35:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:35:37] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: Debian package for httpbb - https://phabricator.wikimedia.org/T299705 (10RLazarus) 05Open→03Resolved
[20:35:42] <wikibugs>	 10SRE, 10Wikimedia-Apache-configuration, 10serviceops: Build a black-box httpd testing framework - https://phabricator.wikimedia.org/T236699 (10RLazarus)
[20:37:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:37:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:38:06] <wikibugs>	 (03PS1) 10Jdlrobson: Update to 78eef14,  rename viewportSize to viewportSizeBucket [extensions/WikimediaEvents] (wmf/1.39.0-wmf.6) - 10https://gerrit.wikimedia.org/r/777389 (https://phabricator.wikimedia.org/T301391)
[20:38:08] <wikibugs>	 (03PS2) 10Urbanecm: Change upload dialog automatic upload comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776323 (https://phabricator.wikimedia.org/T305303) (owner: 10Zabe)
[20:38:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:38:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:38:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:38:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:38:16] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Change upload dialog automatic upload comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776323 (https://phabricator.wikimedia.org/T305303) (owner: 10Zabe)
[20:38:41] <wikibugs>	 (03CR) 10Jdlrobson: "Jan: Any chance you would be able to backport this tomorrow? We were hoping to go into next week with some data to help make a decision ar" [extensions/WikimediaEvents] (wmf/1.39.0-wmf.6) - 10https://gerrit.wikimedia.org/r/777389 (https://phabricator.wikimedia.org/T301391) (owner: 10Jdlrobson)
[20:39:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:39:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:39:31] <wikibugs>	 (03Merged) 10jenkins-bot: Change upload dialog automatic upload comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776323 (https://phabricator.wikimedia.org/T305303) (owner: 10Zabe)
[20:40:31] <urbanecm>	 zabe: your patch is at mwdebug1001
[20:40:33] <urbanecm>	 please test
[20:41:00] <zabe>	 doing
[20:41:12] <razzi>	 !log deploying refinery for https://gerrit.wikimedia.org/r/c/analytics/refinery/+/776269/
[20:41:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:44:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:44:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:45:12] <zabe>	 urbanecm, lgtm: https://commons.wikimedia.org/w/index.php?title=File:Hi-776323.png&action=history
[20:45:24] <urbanecm>	 looks good too
[20:45:26] <urbanecm>	 syncing
[20:47:19] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: 8ea86349017e71dcd38bde0663cfb13e86fe127c: Change upload dialog automatic upload comments (T305303) (duration: 00m 54s)
[20:47:21] <urbanecm>	 zabe: it's live
[20:47:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:47:23] <urbanecm>	 anything else?
[20:47:25] <stashbot>	 T305303: Change upload dialog edit summary on Commons - https://phabricator.wikimedia.org/T305303
[20:47:29] <mutante>	 !log puppetmaster1001 - running test downloads of geoip databases to a temp dir
[20:47:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:47:47] <zabe>	 no, thx :)
[20:48:05] <urbanecm>	 okay :)
[20:48:08] <urbanecm>	 then we're done
[20:48:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:48:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:48:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:48:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:48:13] <urbanecm>	 !log UTC late B&C window done
[20:48:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:49:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:49:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:50:17] <logmsgbot>	 !log razzi@deploy1002 Started deploy [analytics/refinery@fd8b410]: Regular analytics weekly train [analytics/refinery@fd8b410]
[20:50:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:53:22] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp6014.drmrs.wmnet
[20:53:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:58:11] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance
[20:58:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:58:13] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance
[20:58:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:58:14] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[20:58:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:58:17] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[20:58:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:58:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T298565)', diff saved to https://phabricator.wikimedia.org/P24133 and previous config saved to /var/cache/conftool/dbconfig/20220405-205822-ladsgroup.json
[20:58:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:58:25] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[21:02:37] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6014.drmrs.wmnet
[21:02:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:13:07] <logmsgbot>	 !log razzi@deploy1002 Finished deploy [analytics/refinery@fd8b410]: Regular analytics weekly train [analytics/refinery@fd8b410] (duration: 22m 50s)
[21:13:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:14:12] <logmsgbot>	 !log razzi@deploy1002 Started deploy [analytics/refinery@fd8b410] (thin): Regular analytics weekly train THIN [analytics/refinery@fd8b410]
[21:14:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:14:23] <logmsgbot>	 !log razzi@deploy1002 Finished deploy [analytics/refinery@fd8b410] (thin): Regular analytics weekly train THIN [analytics/refinery@fd8b410] (duration: 00m 10s)
[21:14:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:14:25] <logmsgbot>	 !log razzi@deploy1002 Started deploy [analytics/refinery@fd8b410] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@fd8b410]
[21:14:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:20:01] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] puppetmaster:geoip: stop trying to download GeoIP1 legacy databases [puppet] - 10https://gerrit.wikimedia.org/r/773843 (https://phabricator.wikimedia.org/T303464) (owner: 10Dzahn)
[21:20:23] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "tested with a manual update run, no files are being removed by this" [puppet] - 10https://gerrit.wikimedia.org/r/773843 (https://phabricator.wikimedia.org/T303464) (owner: 10Dzahn)
[21:21:13] <logmsgbot>	 !log razzi@deploy1002 Finished deploy [analytics/refinery@fd8b410] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@fd8b410] (duration: 06m 48s)
[21:21:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:26:03] <wikibugs>	 (03PS1) 10Zabe: extdist: change response code from 302 to 301 [puppet] - 10https://gerrit.wikimedia.org/r/777446
[21:26:12] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "Good to go!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776254 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[21:26:56] <wikibugs>	 (03PS2) 10Zabe: extdist: change response code from 302 to 301 [puppet] - 10https://gerrit.wikimedia.org/r/777446
[21:41:53] <wikibugs>	 (03PS8) 10Ryan Kemper: elastic: don't wait for green on first node [software/spicerack] - 10https://gerrit.wikimedia.org/r/776999 (https://phabricator.wikimedia.org/T304570) (owner: 10Bking)
[21:45:09] <wikibugs>	 10SRE, 10Data-Engineering, 10Traffic, 10Trust-and-Safety, 10serviceops: Disable GeoIP Legacy Download - https://phabricator.wikimedia.org/T303464 (10Dzahn)
[21:50:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elastic: don't wait for green on first node [software/spicerack] - 10https://gerrit.wikimedia.org/r/776999 (https://phabricator.wikimedia.org/T304570) (owner: 10Bking)
[21:54:59] <wikibugs>	 10SRE, 10Data-Engineering, 10Traffic, 10Trust-and-Safety, 10serviceops: Disable GeoIP Legacy Download - https://phabricator.wikimedia.org/T303464 (10Dzahn) > Modify the puppet code to no longer download the databases from MaxMind and then propagate to other servers/destinations.   This is done.  puppet c...
[21:57:33] <wikibugs>	 10SRE, 10Data-Engineering, 10Traffic, 10Trust-and-Safety, 10serviceops: Disable GeoIP Legacy Download / Identify all users of legacy (v1) GeoIP datasets and inform them of the need to switch to GeoIP2 dataset - https://phabricator.wikimedia.org/T303464 (10Dzahn) a:05Dzahn→03None
[21:58:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T298565)', diff saved to https://phabricator.wikimedia.org/P24135 and previous config saved to /var/cache/conftool/dbconfig/20220405-215837-ladsgroup.json
[21:58:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:58:41] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[21:58:47] <wikibugs>	 (03PS5) 10Zabe: postgresql: migrate backup crons to systemd timer jobs [puppet] - 10https://gerrit.wikimedia.org/r/777433 (https://phabricator.wikimedia.org/T273673)
[22:01:46] <wikibugs>	 10SRE: prometheus: ganglia-gen and outdated Ganglia:cluster resource name - https://phabricator.wikimedia.org/T186918 (10Dzahn) The file mentioned was removed in T253555  / https://gerrit.wikimedia.org/r/c/operations/puppet/+/609131
[22:02:54] <wikibugs>	 10SRE, 10Analytics, 10Traffic, 10Patch-For-Review: Remove ganglia leftovers from ops/puppet - https://phabricator.wikimedia.org/T253555 (10Dzahn)
[22:03:00] <wikibugs>	 10SRE: prometheus: ganglia-gen and outdated Ganglia:cluster resource name - https://phabricator.wikimedia.org/T186918 (10Dzahn)
[22:03:10] <icinga-wm>	 PROBLEM - SSH on wtp1041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:03:39] <wikibugs>	 (03CR) 10Zabe: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1003/34717/" [puppet] - 10https://gerrit.wikimedia.org/r/777433 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[22:03:40] <mutante>	 zabe: thank you! please give it .sh file extension. That would add CI checks for shell scripts
[22:04:03] <zabe>	 ok
[22:06:30] <wikibugs>	 (03PS6) 10Zabe: postgresql: migrate backup crons to systemd timer jobs [puppet] - 10https://gerrit.wikimedia.org/r/777433 (https://phabricator.wikimedia.org/T273673)
[22:07:18] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Epic: Move most (all?) exim personal aliases to WMF ITS - https://phabricator.wikimedia.org/T122144 (10Dzahn)
[22:08:31] <zabe>	 mutante, done
[22:08:33] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10Dzahn) 05Open→03Stalled ACK, I am setting this to stalled until May.
[22:09:43] <wikibugs>	 (03CR) 10Zabe: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1001/34718/" [puppet] - 10https://gerrit.wikimedia.org/r/777433 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[22:12:25] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[22:13:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P24136 and previous config saved to /var/cache/conftool/dbconfig/20220405-221342-ladsgroup.json
[22:13:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:14:00] <wikibugs>	 (03PS2) 10Zabe: postgresql: remove absented backup crons [puppet] - 10https://gerrit.wikimedia.org/r/777434 (https://phabricator.wikimedia.org/T273673)
[22:14:40] <mutante>	 zabe: ack, thanks!
[22:18:10] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10Dzahn)
[22:28:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P24137 and previous config saved to /var/cache/conftool/dbconfig/20220405-222847-ladsgroup.json
[22:28:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:29:14] <wikibugs>	 (03PS1) 10Zabe: zookeeper: migrate zookeeper-cleanup cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/777451 (https://phabricator.wikimedia.org/T273673)
[22:29:16] <wikibugs>	 (03PS1) 10Zabe: zookeeper: remove absented zookeeper-cleanup cron [puppet] - 10https://gerrit.wikimedia.org/r/777452 (https://phabricator.wikimedia.org/T273673)
[22:29:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] zookeeper: migrate zookeeper-cleanup cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/777451 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[22:31:01] <wikibugs>	 (03PS2) 10Zabe: zookeeper: migrate zookeeper-cleanup cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/777451 (https://phabricator.wikimedia.org/T273673)
[22:37:46] <jinxer-wm>	 (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs2003:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC
[22:43:10] <wikibugs>	 (03PS1) 10Zabe: toil: migrate systemd_scope_cleanup cron to a systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/777453 (https://phabricator.wikimedia.org/T273673)
[22:43:12] <wikibugs>	 (03PS1) 10Zabe: toil: remove absented systemd_scope_cleanup cron [puppet] - 10https://gerrit.wikimedia.org/r/777454 (https://phabricator.wikimedia.org/T273673)
[22:43:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] toil: migrate systemd_scope_cleanup cron to a systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/777453 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[22:43:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T298565)', diff saved to https://phabricator.wikimedia.org/P24138 and previous config saved to /var/cache/conftool/dbconfig/20220405-224352-ladsgroup.json
[22:43:54] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[22:43:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:43:55] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[22:43:58] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[22:43:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:44:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:44:01] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] toil: remove absented systemd_scope_cleanup cron [puppet] - 10https://gerrit.wikimedia.org/r/777454 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[22:44:48] <wikibugs>	 (03PS2) 10Zabe: toil: migrate systemd_scope_cleanup cron to a systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/777453 (https://phabricator.wikimedia.org/T273673)
[22:45:39] <wikibugs>	 (03PS2) 10Zabe: toil: remove absented systemd_scope_cleanup cron [puppet] - 10https://gerrit.wikimedia.org/r/777454 (https://phabricator.wikimedia.org/T273673)
[22:52:05] <wikibugs>	 (03PS1) 10Zabe: cinderutils: remove absented file [puppet] - 10https://gerrit.wikimedia.org/r/777456
[22:52:54] <icinga-wm>	 PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:15:42] <wikibugs>	 (03CR) 10Zabe: "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1001/34719/" [puppet] - 10https://gerrit.wikimedia.org/r/761718 (owner: 10Zabe)
[23:28:17] <wikibugs>	 (03PS2) 10Reedy: Use namespaced GerritExtDistProvider [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774963
[23:30:36] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[23:30:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:30:38] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[23:30:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:30:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24139 and previous config saved to /var/cache/conftool/dbconfig/20220405-233042-ladsgroup.json
[23:30:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:30:47] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[23:31:14] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[23:33:20] <icinga-wm>	 RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[23:54:00] <icinga-wm>	 RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook