[00:01:20] <icinga-wm>	 RECOVERY - Check systemd state on grafana1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:03:32] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[00:03:35] <wikibugs_>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] delete expired certs etcd.eqiad.wmnet.crt and etcd.codfw.wmnet.crt [puppet] - 10https://gerrit.wikimedia.org/r/791671 (https://phabricator.wikimedia.org/T307382) (owner: 10Dzahn)
[00:05:38] <icinga-wm>	 PROBLEM - purged service on cp3060 is CRITICAL: CRITICAL - Expecting active but unit purged is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[00:11:34] <icinga-wm>	 PROBLEM - Check systemd state on an-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-clean-fairscheduler-event-logs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:22:21] <wikibugs_>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "noop on conf* and elsewhere like kubetcd*" [puppet] - 10https://gerrit.wikimedia.org/r/791671 (https://phabricator.wikimedia.org/T307382) (owner: 10Dzahn)
[00:22:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298560)', diff saved to https://phabricator.wikimedia.org/P28159 and previous config saved to /var/cache/conftool/dbconfig/20220520-002227-ladsgroup.json
[00:22:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:22:34] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[00:23:57] <wikibugs_>	 (03CR) 10Dzahn: "I had similar ones back in 2017 but eventually gave up. https://gerrit.wikimedia.org/r/q/topic:expired-certs" [puppet] - 10https://gerrit.wikimedia.org/r/791673 (owner: 10Dzahn)
[00:27:57] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host netmon1003.wikimedia.org with OS bullseye
[00:28:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:28:03] <wikibugs_>	 10SRE, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install netmon1003 - https://phabricator.wikimedia.org/T299106 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host netmon1003.wikimedia.org with OS bullseye
[00:29:31] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on netmon1003.wikimedia.org with reason: host reimage
[00:29:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:31:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:33:04] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netmon1003.wikimedia.org with reason: host reimage
[00:33:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:37:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P28160 and previous config saved to /var/cache/conftool/dbconfig/20220520-003732-ladsgroup.json
[00:37:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:38:14] <icinga-wm>	 RECOVERY - SSH on labweb1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:40:39] <wikibugs_>	 (03CR) 10Dzahn: "I am not very familiar with the partman syntax but the easiest way forward here is to just try it and reimage gitlab1004 which is still "i" [puppet] - 10https://gerrit.wikimedia.org/r/793534 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto)
[00:44:55] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netmon1003.wikimedia.org with OS bullseye
[00:44:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:45:01] <wikibugs_>	 10SRE, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install netmon1003 - https://phabricator.wikimedia.org/T299106 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host netmon1003.wikimedia.org with OS bullseye completed: - netmon1003 (**WARN**)   - Downtimed...
[00:52:29] <wikibugs_>	 10SRE, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install netmon1003 - https://phabricator.wikimedia.org/T299106 (10Papaul)
[00:52:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P28161 and previous config saved to /var/cache/conftool/dbconfig/20220520-005237-ladsgroup.json
[00:52:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:53:45] <wikibugs_>	 10SRE, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install netmon1003 - https://phabricator.wikimedia.org/T299106 (10Papaul) a:05Papaul→03Jclark-ctr @Jclark-ctr complete
[01:07:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298560)', diff saved to https://phabricator.wikimedia.org/P28162 and previous config saved to /var/cache/conftool/dbconfig/20220520-010743-ladsgroup.json
[01:07:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:07:48] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[01:09:18] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[01:09:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:23:19] <wikibugs_>	 (03PS1) 10Cathal Mooney: Automtation changes to match cloudsw config after migration. [homer/public] - 10https://gerrit.wikimedia.org/r/793568 (https://phabricator.wikimedia.org/T304989)
[01:23:56] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Automtation changes to match cloudsw config after migration. [homer/public] - 10https://gerrit.wikimedia.org/r/793568 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney)
[01:26:03] <wikibugs_>	 (03PS2) 10Cathal Mooney: Automtation changes to match cloudsw config after migration. [homer/public] - 10https://gerrit.wikimedia.org/r/793568 (https://phabricator.wikimedia.org/T304989)
[01:27:37] <wikibugs_>	 10SRE, 10ops-codfw, 10Traffic: codfw: cp2038 Correctable memory error on DIMM A3 - https://phabricator.wikimedia.org/T308459 (10ssingh)
[01:28:50] <wikibugs_>	 (03CR) 10Cathal Mooney: [C: 03+2] Automtation changes to match cloudsw config after migration. [homer/public] - 10https://gerrit.wikimedia.org/r/793568 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney)
[01:29:26] <wikibugs_>	 (03Merged) 10jenkins-bot: Automtation changes to match cloudsw config after migration. [homer/public] - 10https://gerrit.wikimedia.org/r/793568 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney)
[01:31:48] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[01:31:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:36:08] <icinga-wm>	 PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:37:57] <wikibugs_>	 (03CR) 10Dzahn: sre: port mx queue high page (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/792564 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[01:40:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:50:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:55:10] <icinga-wm>	 PROBLEM - Disk space on gitlab1003 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%): /tmp 0 MB (0% inode=96%): /var/tmp 0 MB (0% inode=96%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=gitlab1003&var-datasource=eqiad+prometheus/ops
[02:05:14] <icinga-wm>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[02:07:24] <icinga-wm>	 RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[03:01:57] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[03:36:43] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[04:46:02] <wikibugs>	 (03PS2) 10KartikMistry: Enable ContentTranslation as default for cs, el, he, ko and tr WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793444 (https://phabricator.wikimedia.org/T298239)
[05:04:51] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:09:30] <marostegui>	 !log dbmaint s1@eqiad T298554
[05:09:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:09:37] <stashbot>	 T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554
[05:12:15] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: codfw: Provision a server script can not run without a cable ID" - https://phabricator.wikimedia.org/T308768 (10Marostegui) p:05Triage→03Medium a:03Papaul
[05:13:33] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good, merging." [puppet] - 10https://gerrit.wikimedia.org/r/792693 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[05:13:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] vagrant: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792693 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[05:14:47] <icinga-wm>	 PROBLEM - BGP status on cr2-eqdfw is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE, AS6939/IPv6: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[05:18:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] striker: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793397 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[05:18:33] <wikibugs>	 (03PS2) 10Muehlenhoff: striker: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793397 (https://phabricator.wikimedia.org/T308013)
[05:22:49] <wikibugs>	 (03PS2) 10Muehlenhoff: dnsdist: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793400 (https://phabricator.wikimedia.org/T308013)
[05:25:13] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] dnsdist: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793400 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[05:26:48] <wikibugs>	 (03PS2) 10Muehlenhoff: gitlab/gitlab_runner: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793399 (https://phabricator.wikimedia.org/T308013)
[05:31:02] <wikibugs>	 (03PS3) 10Muehlenhoff: bird/fastnetmon: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793398 (https://phabricator.wikimedia.org/T308013)
[05:33:53] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2092: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/793587
[05:34:35] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db2092: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/793587 (owner: 10Marostegui)
[05:35:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] bird/fastnetmon: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793398 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[05:39:41] <icinga-wm>	 PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:51:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:03:04] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10Dzahn) >>! In T306654#7942789, @jbond wrote: >  the work flow that requires puppet-merge  @Jclark-ctr  Correct me if I'm wrong but this is mostly abou...
[06:03:19] <moritzm>	 !log racadm racreset on ganeti5003
[06:03:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:13:37] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:15:09] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03RobH @RobH I'm unable to reimage ganeti5003, the ipmitool call fails with "Error: Unable to establish IPMI v2 / RMCP+ session"  I've tried a racres...
[06:15:51] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (20) node(s) change every puppet run: an-tool1005, cloudservices1003, cloudservices1004, ms-be1068, ms-be1069, ms-be1070, ms-be1071, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, releases1002, releases2002, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002, thanos-fe2003 https://wikitech.wik
[06:15:51] <icinga-wm>	 rg/wiki/Puppet%23check_puppet_run_changes
[06:18:21] <wikibugs>	 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic: Betacommons: 504, Connection Timed Out at 2022-05-02 13:35:16 GMT - https://phabricator.wikimedia.org/T307354 (10Marostegui) 05Open→03Resolved Closing this for now. Reopen if needed. Thanks for reporting!
[06:19:21] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch idp-test2002 to idp_test role [puppet] - 10https://gerrit.wikimedia.org/r/793634 (https://phabricator.wikimedia.org/T308214)
[06:19:37] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.12 point update - https://phabricator.wikimedia.org/T304546 (10Marostegui) @MoritzMuehlenhoff good to close?
[06:19:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Enable component/ganeti3 for the esams cluster [puppet] - 10https://gerrit.wikimedia.org/r/793491 (https://phabricator.wikimedia.org/T308238) (owner: 10Muehlenhoff)
[06:20:13] <wikibugs>	 (03PS2) 10Muehlenhoff: Switch idp-test2002 to idp_test role [puppet] - 10https://gerrit.wikimedia.org/r/793634 (https://phabricator.wikimedia.org/T308214)
[06:34:35] <wikibugs>	 (03PS1) 10Marostegui: db1118: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/793635
[06:36:38] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1118: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/793635 (owner: 10Marostegui)
[06:36:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 1%: After switchover', diff saved to https://phabricator.wikimedia.org/P28164 and previous config saved to /var/cache/conftool/dbconfig/20220520-063656-root.json
[06:37:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:40:49] <icinga-wm>	 RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:41:33] <icinga-wm>	 RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:46:10] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] WIP: Trial implementation of a private APT repo. [puppet] - 10https://gerrit.wikimedia.org/r/793036 (https://phabricator.wikimedia.org/T308027) (owner: 10Slyngshede)
[06:48:52] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Machine-Learning-Team (Active Tasks): Requesting access to the deployment POSIX group for aikochou and kevinbazira - https://phabricator.wikimedia.org/T308308 (10elukey) @thcipriani Hi! When you have a moment, could you please review this request and let me know if it is a goo...
[06:52:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 5%: After switchover', diff saved to https://phabricator.wikimedia.org/P28166 and previous config saved to /var/cache/conftool/dbconfig/20220520-065200-root.json
[06:52:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:54:44] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "Technically LGTM, however this might have some repercussions. There is some minor puppet stuff (mostly mtail tests, so nothing breaking) t" [debs/pybal] - 10https://gerrit.wikimedia.org/r/743222 (owner: 10Ebernhardson)
[06:55:55] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: mediawiki::system_users: add mwpresync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793418 (https://phabricator.wikimedia.org/T303857) (owner: 10Giuseppe Lavagetto)
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220520T0700)
[07:01:25] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:01:57] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[07:03:25] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48108 bytes in 0.119 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:06:26] <wikibugs>	 (03PS1) 10Slyngshede: WIP: Private APT repo [puppet] - 10https://gerrit.wikimedia.org/r/793704
[07:07:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 10%: After switchover', diff saved to https://phabricator.wikimedia.org/P28167 and previous config saved to /var/cache/conftool/dbconfig/20220520-070704-root.json
[07:07:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:08:01] <wikibugs>	 (03PS2) 10Slyngshede: WIP: Private APT repo [puppet] - 10https://gerrit.wikimedia.org/r/793704
[07:09:49] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:10:08] <wikibugs>	 (03CR) 10Muehlenhoff: WIP: Private APT repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793704 (owner: 10Slyngshede)
[07:10:53] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:11:20] <wikibugs>	 (03PS3) 10Slyngshede: WIP: Private APT repo [puppet] - 10https://gerrit.wikimedia.org/r/793704
[07:11:54] <wikibugs>	 (03CR) 10Slyngshede: WIP: Private APT repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793704 (owner: 10Slyngshede)
[07:14:57] <wikibugs>	 (03Abandoned) 10Jcrespo: Revert "dumps: Block python requests UA" [puppet] - 10https://gerrit.wikimedia.org/r/784715 (owner: 10Jcrespo)
[07:15:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/793704 (owner: 10Slyngshede)
[07:15:50] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] WIP: Private APT repo [puppet] - 10https://gerrit.wikimedia.org/r/793704 (owner: 10Slyngshede)
[07:18:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks great, ship it" [debs/kubeconform] - 10https://gerrit.wikimedia.org/r/792999 (https://phabricator.wikimedia.org/T306165) (owner: 10JMeybohm)
[07:22:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 25%: After switchover', diff saved to https://phabricator.wikimedia.org/P28168 and previous config saved to /var/cache/conftool/dbconfig/20220520-072208-root.json
[07:22:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:24:50] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.12 point update - https://phabricator.wikimedia.org/T304546 (10MoritzMuehlenhoff) That is still work in progress
[07:25:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch idp-test2002 to idp_test role [puppet] - 10https://gerrit.wikimedia.org/r/793634 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff)
[07:33:17] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Add debian directory [debs/kubeconform] - 10https://gerrit.wikimedia.org/r/792999 (https://phabricator.wikimedia.org/T306165) (owner: 10JMeybohm)
[07:35:11] <wikibugs>	 (03PS1) 10Elukey: profile::cassandra::single_instance: add target_version [puppet] - 10https://gerrit.wikimedia.org/r/793707 (https://phabricator.wikimedia.org/T302232)
[07:35:32] <wikibugs>	 (03PS1) 10Muehlenhoff: profile::idp::build: Add missing package required for CAS build [puppet] - 10https://gerrit.wikimedia.org/r/793708
[07:35:54] <wikibugs>	 (03Merged) 10jenkins-bot: Add debian directory [debs/kubeconform] - 10https://gerrit.wikimedia.org/r/792999 (https://phabricator.wikimedia.org/T306165) (owner: 10JMeybohm)
[07:36:26] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35429/console" [puppet] - 10https://gerrit.wikimedia.org/r/793707 (https://phabricator.wikimedia.org/T302232) (owner: 10Elukey)
[07:36:45] <wikibugs>	 (03PS2) 10Muehlenhoff: profile::idp::build: Add missing package required for CAS build [puppet] - 10https://gerrit.wikimedia.org/r/793708
[07:37:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 50%: After switchover', diff saved to https://phabricator.wikimedia.org/P28169 and previous config saved to /var/cache/conftool/dbconfig/20220520-073712-root.json
[07:37:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:38:50] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "The profile::cassandra::single_instance seems used only in deployment-prep afaics, so maybe it is not really used at all. Asking around be" [puppet] - 10https://gerrit.wikimedia.org/r/793707 (https://phabricator.wikimedia.org/T302232) (owner: 10Elukey)
[07:40:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] profile::idp::build: Add missing package required for CAS build [puppet] - 10https://gerrit.wikimedia.org/r/793708 (owner: 10Muehlenhoff)
[07:48:02] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 (10MoritzMuehlenhoff)
[07:50:31] <wikibugs>	 (03PS1) 10Slyngshede: WIP: Private APT repo. [puppet] - 10https://gerrit.wikimedia.org/r/793710
[07:51:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP: Private APT repo. [puppet] - 10https://gerrit.wikimedia.org/r/793710 (owner: 10Slyngshede)
[07:52:03] <jayme>	 !log imported kubeconform 0.4.13-1 to buster-,bullseye-wikimedia - T306165
[07:52:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:52:09] <stashbot>	 T306165: Replace kubeyaml in deployment-charts CI - https://phabricator.wikimedia.org/T306165
[07:52:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 75%: After switchover', diff saved to https://phabricator.wikimedia.org/P28170 and previous config saved to /var/cache/conftool/dbconfig/20220520-075215-root.json
[07:52:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:53:03] <wikibugs>	 (03PS2) 10Slyngshede: WIP: Private APT repo. [puppet] - 10https://gerrit.wikimedia.org/r/793710
[07:53:11] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2062.codfw.wmnet with OS bullseye
[07:53:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:53:15] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2062.codfw.wmnet with OS bullseye
[07:54:04] <wikibugs>	 (03PS2) 10Elukey: profile::cassandra::single_instance: add target_version and rack [puppet] - 10https://gerrit.wikimedia.org/r/793707 (https://phabricator.wikimedia.org/T302232)
[07:54:06] <wikibugs>	 (03PS1) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711
[07:55:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711 (owner: 10Jcrespo)
[07:55:18] <wikibugs>	 (03PS3) 10Elukey: profile::cassandra::single_instance: add target_version and rack [puppet] - 10https://gerrit.wikimedia.org/r/793707 (https://phabricator.wikimedia.org/T302232)
[07:56:18] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35430/console" [puppet] - 10https://gerrit.wikimedia.org/r/793707 (https://phabricator.wikimedia.org/T302232) (owner: 10Elukey)
[08:06:58] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on idp-test2002 is CRITICAL: connect to address 127.0.0.1 and port 9631: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:07:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 100%: After switchover', diff saved to https://phabricator.wikimedia.org/P28171 and previous config saved to /var/cache/conftool/dbconfig/20220520-080719-root.json
[08:07:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:09:53] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2062.codfw.wmnet with reason: host reimage
[08:09:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:10:22] <icinga-wm>	 PROBLEM - Check that envoy is running on idp-test2002 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is inactive https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[08:12:49] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2062.codfw.wmnet with reason: host reimage
[08:12:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:20:18] <wikibugs>	 (03PS1) 10Elukey: Add new Cassandra cluster for ML cache/feature-store workloads in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/793714 (https://phabricator.wikimedia.org/T302232)
[08:21:28] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35431/console" [puppet] - 10https://gerrit.wikimedia.org/r/793714 (https://phabricator.wikimedia.org/T302232) (owner: 10Elukey)
[08:22:32] <wikibugs>	 (03PS2) 10Elukey: Add new Cassandra cluster for ML cache/feature-store workloads in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/793714 (https://phabricator.wikimedia.org/T302232)
[08:38:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/793707 (https://phabricator.wikimedia.org/T302232) (owner: 10Elukey)
[08:42:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thanks for the feedback folks! I'll merge early next week" [alerts] - 10https://gerrit.wikimedia.org/r/792564 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[08:43:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! (untested)" [puppet] - 10https://gerrit.wikimedia.org/r/793094 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo)
[08:44:45] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "recheck (for https://gerrit.wikimedia.org/r/c/integration/config/+/793506)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/791324 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan)
[08:44:48] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2062.codfw.wmnet with OS bullseye
[08:44:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:44:52] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2062.codfw.wmnet with OS bullseye completed: - ms-be2062 (**PASS**)   - Downtim...
[08:45:05] <wikibugs>	 (03CR) 10Jcrespo: alerting_host: Remove references to dbbackups monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791560 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo)
[08:45:37] <wikibugs>	 (03CR) 10JMeybohm: "recheck (expected to fail, missing local schema)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/791794 (https://phabricator.wikimedia.org/T306165) (owner: 10JMeybohm)
[08:46:00] <icinga-wm>	 PROBLEM - Memcached on idp-test2002 is CRITICAL: connect to address 208.80.153.70 and port 11000: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[08:46:44] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] alert_host: Ensure packages and files from dbbackups check are gone [puppet] - 10https://gerrit.wikimedia.org/r/793094 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo)
[08:46:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Replace kubeyaml with kubeconform (if available) [deployment-charts] - 10https://gerrit.wikimedia.org/r/791794 (https://phabricator.wikimedia.org/T306165) (owner: 10JMeybohm)
[08:48:17] <wikibugs>	 (03PS1) 10Elukey: Add fake secret for the new ML Cassandra cluster [labs/private] - 10https://gerrit.wikimedia.org/r/793717
[08:48:32] <wikibugs>	 (03CR) 10Btullis: "I've now generated the keys in the private repo, so this should be unblocked." [puppet] - 10https://gerrit.wikimedia.org/r/791663 (https://phabricator.wikimedia.org/T307798) (owner: 10Eevans)
[08:48:40] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake secret for the new ML Cassandra cluster [labs/private] - 10https://gerrit.wikimedia.org/r/793717 (owner: 10Elukey)
[08:51:45] <wikibugs>	 (03PS3) 10Elukey: Add new Cassandra cluster for ML cache/feature-store workloads in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/793714 (https://phabricator.wikimedia.org/T302232)
[08:51:48] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "This looks good to me. Should I +2 and deploy now?" [puppet] - 10https://gerrit.wikimedia.org/r/791663 (https://phabricator.wikimedia.org/T307798) (owner: 10Eevans)
[08:51:52] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/793399 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[08:52:05] <wikibugs>	 (03PS2) 10Jcrespo: ddbackups: Remove old references to the check pass on the alert hosts [labs/private] - 10https://gerrit.wikimedia.org/r/793042 (https://phabricator.wikimedia.org/T283017)
[08:52:16] <wikibugs>	 (03PS2) 10Btullis: Enable cassandra encryption (aqs cluster) [puppet] - 10https://gerrit.wikimedia.org/r/791663 (https://phabricator.wikimedia.org/T307798) (owner: 10Eevans)
[08:52:38] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35434/console" [puppet] - 10https://gerrit.wikimedia.org/r/793714 (https://phabricator.wikimedia.org/T302232) (owner: 10Elukey)
[08:53:58] <vgutierrez>	 !log re-enabling puppet  and repooling cp3060 - T308797 T243167
[08:54:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:54:05] <stashbot>	 T243167: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167
[08:54:06] <stashbot>	 T308797: cp3060 idrac https interface failures - https://phabricator.wikimedia.org/T308797
[08:55:58] <icinga-wm>	 RECOVERY - purged service on cp3060 is OK: OK - purged is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:57:07] <wikibugs>	 10SRE, 10ops-esams, 10DC-Ops: cp3060 idrac https interface failures - https://phabricator.wikimedia.org/T308797 (10Vgutierrez) @RobH feel free to depool / disable-puppet again on cp3060 when you need to work on it, meanwhile I'm letting cp3060 handle some traffic in esams :)
[09:00:11] <wikibugs>	 (03CR) 10Jcrespo: [V: 03+2 C: 03+2] ddbackups: Remove old references to the check pass on the alert hosts [labs/private] - 10https://gerrit.wikimedia.org/r/793042 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo)
[09:07:20] <wikibugs>	 (03PS18) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017)
[09:11:43] <wikibugs>	 (03PS1) 10Filippo Giunchedi: sre: add fastnetmon alerting page [alerts] - 10https://gerrit.wikimedia.org/r/793723 (https://phabricator.wikimedia.org/T305847)
[09:13:03] <wikibugs>	 (03PS1) 10Volans: wikimedia-dns: add zone validator ignore comments [dns] - 10https://gerrit.wikimedia.org/r/793724 (https://phabricator.wikimedia.org/T155761)
[09:13:05] <wikibugs>	 (03PS1) 10Volans: fr-tech: fix typo in PTR record [dns] - 10https://gerrit.wikimedia.org/r/793725 (https://phabricator.wikimedia.org/T308672)
[09:13:07] <wikibugs>	 (03PS1) 10Volans: fr-tech: add zone validator ignore comments [dns] - 10https://gerrit.wikimedia.org/r/793726 (https://phabricator.wikimedia.org/T308672)
[09:13:09] <wikibugs>	 (03PS1) 10Volans: Non-WMF IPs: add zone validator ignore comments [dns] - 10https://gerrit.wikimedia.org/r/793727 (https://phabricator.wikimedia.org/T155761)
[09:13:11] <wikibugs>	 (03PS1) 10Volans: Duplicate IPs by design: add zone validator ignore [dns] - 10https://gerrit.wikimedia.org/r/793728 (https://phabricator.wikimedia.org/T155761)
[09:13:13] <wikibugs>	 (03PS1) 10Volans: wikitech-static-iad: remove obsolete records [dns] - 10https://gerrit.wikimedia.org/r/793729 (https://phabricator.wikimedia.org/T155761)
[09:14:03] <wikibugs>	 (03PS4) 10Filippo Giunchedi: mediawiki: remove idle php-fpm workers alert, moved to prometheus/alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/791360 (https://phabricator.wikimedia.org/T305847)
[09:14:05] <wikibugs>	 (03PS4) 10Filippo Giunchedi: mx: remove queue size alert, moved to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/792568 (https://phabricator.wikimedia.org/T305847)
[09:14:07] <wikibugs>	 (03PS1) 10Filippo Giunchedi: fastnetmon: remove alert, ported to Prometheus / Alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/793731 (https://phabricator.wikimedia.org/T305847)
[09:14:09] <wikibugs>	 (03CR) 10Jcrespo: "Not yet happy with the hiera parameters, so this is not deploy-ready, plus not sure if some of the change are bug fixes or intentional, bu" [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo)
[09:15:06] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: puppet admin: cxheck if additional gropus in systenmd::sysuser conflicts with admin.yaml - https://phabricator.wikimedia.org/T308826 (10jbond) p:05Triage→03Medium
[09:16:20] <wikibugs>	 (03PS2) 10Volans: Duplicate names by design: add zone validator ignore [dns] - 10https://gerrit.wikimedia.org/r/793728 (https://phabricator.wikimedia.org/T155761)
[09:16:30] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] mediawiki::system_users: add mwpresync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793418 (https://phabricator.wikimedia.org/T303857) (owner: 10Giuseppe Lavagetto)
[09:17:44] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2063.codfw.wmnet with OS bullseye
[09:17:48] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2063.codfw.wmnet with OS bullseye
[09:17:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:56] <wikibugs>	 (03CR) 10Jbond: "CR LGTM, is there a phab task to link to?" [puppet] - 10https://gerrit.wikimedia.org/r/793550 (owner: 10JHathaway)
[09:19:53] <wikibugs>	 (03PS5) 10Jcrespo: alerting_host: Remove references to dbbackups monitoring [puppet] - 10https://gerrit.wikimedia.org/r/791560 (https://phabricator.wikimedia.org/T283017)
[09:24:38] <wikibugs>	 (03CR) 10Volans: "I've also quickly checked on the rackspace portal and didn't find any reference to wikitech-static-iad" [dns] - 10https://gerrit.wikimedia.org/r/793729 (https://phabricator.wikimedia.org/T155761) (owner: 10Volans)
[09:33:45] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2063.codfw.wmnet with reason: host reimage
[09:34:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:19] <logmsgbot>	 !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ms-be2063.codfw.wmnet with reason: host reimage
[09:34:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:35:19] <jinxer-wm>	 (ProbeDown) firing: (15) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:35:19] <jinxer-wm>	 (ProbeDown) firing: (21) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:35:34] * volans here
[09:35:41] * jelto here
[09:35:45] * volans acked on VO
[09:35:55] <godog>	 here too
[09:36:07] * jbond here
[09:36:10] <icinga-wm>	 PROBLEM - Apache HTTP on mw1333 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[09:36:13] <Amir1>	 On phone
[09:36:19] <icinga-wm>	 PROBLEM - Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page on alert1001 is CRITICAL: 0.002481 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver
[09:36:22] <icinga-wm>	 PROBLEM - Apache HTTP on mw1365 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[09:36:23] <_joe_>	 it's the database
[09:36:24] <icinga-wm>	 PROBLEM - Apache HTTP on mw1419 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[09:36:40] <twentyafterfour>	 wow it wasn't just me ... 
[09:36:40] <icinga-wm>	 PROBLEM - Apache HTTP on mw1364 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[09:36:40] <Amir1>	 Running back
[09:36:40] <icinga-wm>	 PROBLEM - Apache HTTP on mw1387 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[09:36:43] <_joe_>	 volans: take a look at the slow queries is my suggestion
[09:36:46] <icinga-wm>	 PROBLEM - Apache HTTP on mw1319 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[09:36:48] <volans>	 yeah I'm looking
[09:36:49] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] alerting_host: Remove references to dbbackups monitoring [puppet] - 10https://gerrit.wikimedia.org/r/791560 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo)
[09:36:50] <icinga-wm>	 PROBLEM - Apache HTTP on mw1371 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[09:37:00] <icinga-wm>	 PROBLEM - Apache HTTP on mw1322 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[09:37:00] <icinga-wm>	 PROBLEM - Apache HTTP on mw1350 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[09:37:08] <icinga-wm>	 PROBLEM - Apache HTTP on mw1366 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[09:37:10] <icinga-wm>	 PROBLEM - Apache HTTP on mw1328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[09:37:12] <icinga-wm>	 PROBLEM - Apache HTTP on mw1368 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[09:37:18] <icinga-wm>	 PROBLEM - Apache HTTP on mw1369 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[09:37:20] <icinga-wm>	 PROBLEM - Apache HTTP on mw1372 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[09:37:21] <volans>	 db1172 comes quite often
[09:37:24] <_joe_>	 marostegui: we might need help
[09:37:30] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204,205} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[09:37:33] <icinga-wm>	 PROBLEM - LVS text-https esams port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1636 bytes in 1.474 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[09:37:36] <icinga-wm>	 PROBLEM - Apache HTTP on mw1355 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[09:37:36] <icinga-wm>	 PROBLEM - Apache HTTP on mw1384 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[09:37:39] <icinga-wm>	 PROBLEM - LVS text-https drmrs port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.drmrs.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[09:37:40] <icinga-wm>	 PROBLEM - LVS text-https ulsfo port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[09:37:42] <icinga-wm>	 PROBLEM - Apache HTTP on mw1454 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[09:37:46] <icinga-wm>	 PROBLEM - Apache HTTP on mw1353 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[09:37:49] <icinga-wm>	 PROBLEM - LVS text-https esams port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[09:37:53] <icinga-wm>	 PROBLEM - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[09:37:53] <icinga-wm>	 PROBLEM - Apache HTTP on mw1326 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[09:37:53] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1326.eqiad.wmnet, mw1433.eqiad.wmnet, mw1414.eqiad.wmnet, mw1371.eqiad.wmnet, mw1455.eqiad.wmnet, mw1453.eqiad.wmnet, mw1442.eqiad.wmnet, mw1434.eqiad.wmnet, mw1322.eqiad.wmnet, mw1432.eqiad.wmnet, mw1323.eqiad.wmnet, mw1384.eqiad.wmnet, mw1327.eqiad.wmnet, mw1413.eqiad.wmnet, mw1456.eqiad.wmnet, mw1407.eqiad.wmnet, mw
[09:37:53] <icinga-wm>	 ad.wmnet, mw1351.eqiad.wmnet, mw1405.eqiad.wmnet, mw1329.eqiad.wmnet, mw1320.eqiad.wmnet, mw1352.eqiad.wmnet, mw1399.eqiad.wmnet, mw1368.eqiad.wmnet, mw1435.eqiad.wmnet, mw1454.eqiad.wmnet, mw1431.eqiad.wmnet, mw1333.eqiad.wmnet, mw1393.eqiad.wmnet, mw1411.eqiad.wmnet, mw1354.eqiad.wmnet, mw1366.eqiad.wmnet, mw1324.eqiad.wmnet, mw1372.eqiad.wmnet, mw1370.eqiad.wmnet, mw1397.eqiad.wmnet, mw1319.eqiad.wmnet, mw1389.eqiad.wmnet, mw1418.eqiad
[09:37:53] <icinga-wm>	 mw1321.eqiad.wmnet, mw1395.eqiad.wmnet, mw1403.eqiad.wmnet, mw1325.eqiad.wmnet, mw1409.eqiad.wmnet, mw1385.eqiad.wmnet, mw1436.eqiad.wmnet, mw1417.eqiad.wmnet, mw1367.eqiad.wmnet, mw144 https://wikitech.wikimedia.org/wiki/PyBal
[09:38:06] <marostegui>	 what's up?
[09:38:07] <Amir1>	 Marostegui ^ jynus 
[09:38:11] <marostegui>	 is it db1172?
[09:38:12] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is CRITICAL: 0.9722 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[09:38:13] <volans>	 db1172 could you have a loot?
[09:38:14] <icinga-wm>	 PROBLEM - Apache HTTP on mw1349 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[09:38:14] <icinga-wm>	 PROBLEM - Apache HTTP on mw1354 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[09:38:18] <icinga-wm>	 PROBLEM - Apache HTTP on mw1351 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[09:38:18] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1433.eqiad.wmnet, mw1365.eqiad.wmnet, mw1419.eqiad.wmnet, mw1442.eqiad.wmnet, mw1434.eqiad.wmnet, mw1432.eqiad.wmnet, mw1349.eqiad.wmnet, mw1384.eqiad.wmnet, mw1387.eqiad.wmnet, mw1430.eqiad.wmnet, mw1405.eqiad.wmnet, mw1329.eqiad.wmnet, mw1320.eqiad.wmnet, mw1399.eqiad.wmnet, mw1435.eqiad.wmnet, mw1420.eqiad.wmnet, mw
[09:38:18] <icinga-wm>	 ad.wmnet, mw1393.eqiad.wmnet, mw1454.eqiad.wmnet, mw1372.eqiad.wmnet, mw1370.eqiad.wmnet, mw1403.eqiad.wmnet, mw1389.eqiad.wmnet, mw1395.eqiad.wmnet, mw1397.eqiad.wmnet, mw1325.eqiad.wmnet, mw1385.eqiad.wmnet, mw1417.eqiad.wmnet, mw1455.eqiad.wmnet, mw1373.eqiad.wmnet, mw1326.eqiad.wmnet, mw1332.eqiad.wmnet, mw1452.eqiad.wmnet, mw1367.eqiad.wmnet, mw1414.eqiad.wmnet, mw1369.eqiad.wmnet, mw1371.eqiad.wmnet, mw1453.eqiad.wmnet, mw1322.eqiad
[09:38:18] <icinga-wm>	 mw1319.eqiad.wmnet, mw1323.eqiad.wmnet, mw1327.eqiad.wmnet, mw1413.eqiad.wmnet, mw1456.eqiad.wmnet, mw1351.eqiad.wmnet, mw1391.eqiad.wmnet, mw1352.eqiad.wmnet, mw1441.eqiad.wmnet, mw141 https://wikitech.wikimedia.org/wiki/PyBal
[09:38:18] <marostegui>	 sure 
[09:38:21] <icinga-wm>	 PROBLEM - LVS text-https drmrs port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.drmrs.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1625 bytes in 0.706 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[09:38:21] <icinga-wm>	 PROBLEM - Apache HTTP on mw1373 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[09:38:21] <icinga-wm>	 PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton
[09:38:22] <icinga-wm>	 PROBLEM - Apache HTTP on mw1397 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[09:38:22] <icinga-wm>	 PROBLEM - Apache HTTP on mw1407 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[09:38:22] <icinga-wm>	 PROBLEM - Apache HTTP on mw1416 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[09:38:22] <icinga-wm>	 PROBLEM - Apache HTTP on mw1370 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[09:38:22] <icinga-wm>	 PROBLEM - Apache HTTP on mw1367 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[09:38:24] <icinga-wm>	 PROBLEM - Apache HTTP on mw1436 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[09:38:29] <icinga-wm>	 PROBLEM - LVS text-https codfw port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.codfw.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[09:38:33] <icinga-wm>	 PROBLEM - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[09:38:34] <icinga-wm>	 RECOVERY - Apache HTTP on mw1333 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 9.134 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[09:38:36] <icinga-wm>	 PROBLEM - LVS text-https eqsin port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.eqsin.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[09:38:37] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.5 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[09:38:40] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[09:38:46] <icinga-wm>	 PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton
[09:38:50] <icinga-wm>	 RECOVERY - Apache HTTP on mw1419 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 8.743 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[09:39:03] <icinga-wm>	 PROBLEM - LVS text-https eqsin port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.eqsin.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[09:39:04] <icinga-wm>	 PROBLEM - LVS appservers-https eqiad port 443/tcp - Main MediaWiki application server cluster- appservers.svc.eqiad.wmnet -https- IPv4 #page on appservers.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[09:39:04] <icinga-wm>	 PROBLEM - Apache HTTP on mw1324 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[09:39:06] <icinga-wm>	 PROBLEM - LVS text-https ulsfo port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.ulsfo.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[09:39:07] <icinga-wm>	 PROBLEM - Apache HTTP on mw1352 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[09:39:14] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 167 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[09:39:28] <logmsgbot>	 !log volans@cumin1001 dbctl commit (dc=all): 'emergency depool', diff saved to https://phabricator.wikimedia.org/P28172 and previous config saved to /var/cache/conftool/dbconfig/20220520-093928-volans.json
[09:39:34] <marostegui>	 I am going to depoo l it for now
[09:39:38] <icinga-wm>	 RECOVERY - Apache HTTP on mw1328 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 8.965 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[09:39:40] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1370 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:39:40] <volans>	 marostegui: already done
[09:39:42] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1384 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:39:59] <stashbot>	 volans@cumin1001: Failed to log message to wiki. Somebody should check the error logs.
[09:40:05] <icinga-wm>	 RECOVERY - LVS text-https drmrs port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.drmrs.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 18861 bytes in 7.722 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[09:40:06] <icinga-wm>	 RECOVERY - LVS text-https esams port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 18885 bytes in 7.260 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[09:40:08] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1369 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:40:08] <icinga-wm>	 RECOVERY - Apache HTTP on mw1454 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 6.375 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[09:40:09] <apergos>	 coordination is happening in the other channel
[09:40:16] <apergos>	 too much noise here
[09:40:17] <icinga-wm>	 RECOVERY - LVS text-https esams port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 18860 bytes in 8.256 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[09:40:18] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1353 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:40:18] <icinga-wm>	 RECOVERY - Apache HTTP on mw1326 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 6.650 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[09:40:20] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1351 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:40:24] <_joe_>	 ok and we're magically back
[09:40:30] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1364 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:40:36] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1368 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:40:38] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1350 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:40:40] <_joe_>	 uhm no
[09:40:44] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1372 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:40:44] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1355 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:40:46] <icinga-wm>	 RECOVERY - Apache HTTP on mw1397 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 5.438 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[09:40:46] <icinga-wm>	 RECOVERY - Apache HTTP on mw1416 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 5.489 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[09:40:46] <icinga-wm>	 RECOVERY - Apache HTTP on mw1407 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 6.183 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[09:40:48] <icinga-wm>	 RECOVERY - Apache HTTP on mw1436 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 6.025 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[09:40:50] <icinga-wm>	 RECOVERY - Apache HTTP on mw1367 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 9.921 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[09:40:56] <jinxer-wm>	 (ProbeDown) firing: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:41:01] <icinga-wm>	 RECOVERY - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 18861 bytes in 9.082 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[09:41:01] <jinxer-wm>	 (ProbeDown) firing: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:41:04] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1349 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:41:04] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1373 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:41:13] <_joe_>	 I'm not sure we're out of the woods
[09:41:20] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1352 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:41:29] <icinga-wm>	 RECOVERY - LVS appservers-https eqiad port 443/tcp - Main MediaWiki application server cluster- appservers.svc.eqiad.wmnet -https- IPv4 #page on appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 17748 bytes in 7.332 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[09:41:30] <icinga-wm>	 RECOVERY - Apache HTTP on mw1324 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 7.413 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[09:41:31] <icinga-wm>	 RECOVERY - LVS text-https ulsfo port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 18861 bytes in 5.797 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[09:41:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1387 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 4.439 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[09:41:40] <icinga-wm>	 RECOVERY - Apache HTTP on mw1319 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 6.901 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[09:41:46] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1365 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:41:48] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1354 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:41:54] <icinga-wm>	 RECOVERY - Apache HTTP on mw1322 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 6.669 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[09:42:01] <jbond>	 looks like moved to db1131
[09:42:04] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1371 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:42:06] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1367 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:42:20] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1366 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:42:26] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:42:29] <icinga-wm>	 RECOVERY - LVS text-https ulsfo port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 18873 bytes in 0.840 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[09:42:45] <icinga-wm>	 RECOVERY - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 18872 bytes in 4.661 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[09:43:16] <icinga-wm>	 RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton
[09:43:21] <icinga-wm>	 RECOVERY - LVS text-https drmrs port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.drmrs.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 18873 bytes in 5.734 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[09:43:28] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[09:43:54] <icinga-wm>	 RECOVERY - Apache HTTP on mw1364 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 7.795 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[09:44:00] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 48 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[09:44:28] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1384 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 9.152 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:45:14] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1368 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 9.946 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:45:18] <jinxer-wm>	 (ProbeDown) firing: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:45:19] <jinxer-wm>	 (ProbeDown) firing: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:45:28] <icinga-wm>	 RECOVERY - LVS text-https codfw port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 18873 bytes in 2.377 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[09:45:45] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:45:52] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1352 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 7.729 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:46:02] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs5001 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5009.eqsin.wmnet, cp5016.eqsin.wmnet, cp5012.eqsin.wmnet, cp5007.eqsin.wmnet, cp5010.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5009.eqsin.wmnet, cp5016.eqsin.wmnet, cp5012.eqsin.wmnet, cp5007.eqsin.wmnet, cp5010.eqsin.wmnet are marked down but pooled: testlb6_443: Servers cp5009.eqsin.wmnet, cp5016.eqsin.wmne
[09:46:02] <icinga-wm>	 8.eqsin.wmnet, cp5012.eqsin.wmnet, cp5007.eqsin.wmnet, cp5010.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5009.eqsin.wmnet, cp5011.eqsin.wmnet, cp5016.eqsin.wmnet, cp5008.eqsin.wmnet, cp5012.eqsin.wmnet, cp5010.eqsin.wmnet, cp5007.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:46:12] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1354 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 6.490 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:46:14] <icinga-wm>	 RECOVERY - Apache HTTP on mw1371 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 6.355 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[09:46:14] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1365 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 9.391 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:46:16] <icinga-wm>	 RECOVERY - Apache HTTP on mw1350 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[09:46:22] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1371 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 1.564 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:46:22] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1367 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:46:24] <icinga-wm>	 RECOVERY - Apache HTTP on mw1366 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 3.507 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[09:46:26] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1370 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.731 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:46:26] <icinga-wm>	 RECOVERY - Apache HTTP on mw1368 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[09:46:28] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:46:30] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs5003 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5009.eqsin.wmnet, cp5007.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5009.eqsin.wmnet, cp5008.eqsin.wmnet, cp5012.eqsin.wmnet, cp5010.eqsin.wmnet, cp5007.eqsin.wmnet are marked down but pooled: testlb6_443: Servers cp5009.eqsin.wmnet, cp5012.eqsin.wmnet, cp5008.eqsin.wmnet, cp5007.eqsin.wmnet are marked down bu
[09:46:30] <icinga-wm>	 : textlb6_443: Servers cp5009.eqsin.wmnet, cp5012.eqsin.wmnet, cp5008.eqsin.wmnet, cp5007.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:46:32] <icinga-wm>	 RECOVERY - Apache HTTP on mw1369 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.161 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[09:46:32] <icinga-wm>	 RECOVERY - Apache HTTP on mw1372 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.268 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[09:46:34] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1366 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.059 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:46:48] <icinga-wm>	 RECOVERY - Apache HTTP on mw1384 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[09:46:48] <icinga-wm>	 RECOVERY - Apache HTTP on mw1355 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.060 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[09:46:52] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1369 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:46:56] <icinga-wm>	 RECOVERY - Apache HTTP on mw1353 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[09:46:58] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1353 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:46:58] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1351 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:47:08] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1364 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:47:08] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:47:16] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1350 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:47:20] <icinga-wm>	 RECOVERY - Apache HTTP on mw1349 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[09:47:20] <icinga-wm>	 RECOVERY - Apache HTTP on mw1354 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[09:47:22] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1355 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:47:22] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1372 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:47:26] <icinga-wm>	 RECOVERY - Apache HTTP on mw1351 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[09:47:28] <icinga-wm>	 RECOVERY - Apache HTTP on mw1373 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[09:47:30] <icinga-wm>	 RECOVERY - Apache HTTP on mw1370 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[09:47:34] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:47:42] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1349 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:47:42] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1373 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:47:46] <icinga-wm>	 RECOVERY - LVS text-https eqsin port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.eqsin.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 18873 bytes in 3.661 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[09:47:48] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10Bawolff)
[09:47:54] <icinga-wm>	 RECOVERY - Apache HTTP on mw1365 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[09:47:55] <icinga-wm>	 RECOVERY - Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.634 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver
[09:47:58] <icinga-wm>	 RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton
[09:48:09] <icinga-wm>	 RECOVERY - LVS text-https eqsin port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.eqsin.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 18859 bytes in 2.065 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[09:48:10] <icinga-wm>	 RECOVERY - Apache HTTP on mw1352 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[09:48:16] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs5001 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:48:44] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs5003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:48:52] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp5003 is CRITICAL: 5.145e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5003
[09:49:04] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[09:49:04] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp5007 is CRITICAL: 3.21e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5007
[09:49:32] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp5015 is CRITICAL: 3.301e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5015
[09:49:40] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[09:50:02] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[09:50:18] <jinxer-wm>	 (ProbeDown) resolved: (20) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:50:18] <jinxer-wm>	 (ProbeDown) resolved: (18) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:50:45] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:51:00] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp5003 is OK: (C)5000 gt (W)3000 gt 649.6 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5003
[09:51:14] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp5007 is OK: (C)5000 gt (W)3000 gt 197.9 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5007
[09:51:38] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp5015 is OK: (C)5000 gt (W)3000 gt 223.7 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5015
[10:01:10] <icinga-wm>	 PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:01:34] <wikibugs>	 10SRE, 10Wikimedia-Incident: Very long loading and crash with an error across all Wikimedia sites - https://phabricator.wikimedia.org/T308827 (10RhinosF1) p:05Unbreak!→03Lowest Dupe of mentioned task
[10:02:07] <wikibugs>	 10SRE, 10Wikimedia-Incident: Very long loading and crash with an error across all Wikimedia sites - https://phabricator.wikimedia.org/T308827 (10RhinosF1)
[10:04:42] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2063.codfw.wmnet with OS bullseye
[10:04:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:46] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2063.codfw.wmnet with OS bullseye completed: - ms-be2063 (**WARN**)   - Downtim...
[10:06:18] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:08:15] <wikibugs>	 (03PS1) 10Ladsgroup: Revert read new on frwiki for templatelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793737
[10:08:44] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:08:53] <wikibugs>	 (03PS2) 10Ladsgroup: Revert read new on frwiki for templatelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793737
[10:14:25] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Revert read new on frwiki for templatelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793737 (owner: 10Ladsgroup)
[10:15:12] <wikibugs>	 (03Merged) 10jenkins-bot: Revert read new on frwiki for templatelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793737 (owner: 10Ladsgroup)
[10:17:04] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:793737|Revert read new on frwiki for templatelinks migration]] (duration: 00m 51s)
[10:17:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[10:18:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[10:19:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[10:19:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:20:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[10:20:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:25:46] <wikibugs>	 (03PS1) 10Jbond: CONTRIBUTORS: add Brian Wolff [puppet] - 10https://gerrit.wikimedia.org/r/793740 (https://phabricator.wikimedia.org/T308013)
[10:26:28] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] CONTRIBUTORS: add Brian Wolff [puppet] - 10https://gerrit.wikimedia.org/r/793740 (https://phabricator.wikimedia.org/T308013) (owner: 10Jbond)
[10:34:20] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] Add helmfile configuration for image-suggestion [deployment-charts] - 10https://gerrit.wikimedia.org/r/791324 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan)
[10:36:57] * TheresNoTime looks at scrollback
[10:37:10] <TheresNoTime>	 and a *good morning* to you too icinga-wm o.O
[10:39:50] <wikibugs>	 (03Merged) 10jenkins-bot: Add helmfile configuration for image-suggestion [deployment-charts] - 10https://gerrit.wikimedia.org/r/791324 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan)
[10:42:15] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:42:39] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/image-suggestion: sync
[10:42:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "For code aimed for production using systemd::sysuser is nowadays the preferred way, but since this is for some initial limited to Cloud VP" [puppet] - 10https://gerrit.wikimedia.org/r/793710 (owner: 10Slyngshede)
[10:51:27] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "I would bump this up to two minutes, otherwise lgtm." [alerts] - 10https://gerrit.wikimedia.org/r/791356 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[10:52:42] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/image-suggestion: sync
[10:52:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:54:03] <moritzm>	 !log uploaded cas 6.4.6.3-wmf11u1 to apt.wikimedia.org/bullseye
[10:54:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:56:11] <wikibugs>	 (03PS3) 10Filippo Giunchedi: sre: port mediawiki php-fpm saturation alert [alerts] - 10https://gerrit.wikimedia.org/r/791356 (https://phabricator.wikimedia.org/T305847)
[10:56:13] <wikibugs>	 (03PS2) 10Filippo Giunchedi: sre: port mx queue high page [alerts] - 10https://gerrit.wikimedia.org/r/792564 (https://phabricator.wikimedia.org/T305847)
[10:56:15] <wikibugs>	 (03PS2) 10Filippo Giunchedi: sre: add fastnetmon alerting page [alerts] - 10https://gerrit.wikimedia.org/r/793723 (https://phabricator.wikimedia.org/T305847)
[10:56:17] <wikibugs>	 (03CR) 10Filippo Giunchedi: sre: port mediawiki php-fpm saturation alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/791356 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[10:58:19] <wikibugs>	 (03PS1) 10Muehlenhoff: Only add component/memcached16 on Buster [puppet] - 10https://gerrit.wikimedia.org/r/793744 (https://phabricator.wikimedia.org/T308214)
[10:59:47] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/793744 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff)
[11:01:57] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[11:09:39] <jynus>	 !log drop backupcheck users from m1>dbbackups
[11:09:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:10:48] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db2104.codfw.wmnet with reason: Maintenance
[11:10:50] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2104.codfw.wmnet with reason: Maintenance
[11:10:51] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 8:00:00 on 8 hosts with reason: Maintenance
[11:10:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:10:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:10:57] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 8:00:00 on 8 hosts with reason: Maintenance
[11:10:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:11:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:12:33] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1131.eqiad.wmnet with reason: Maintenance
[11:12:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1131.eqiad.wmnet with reason: Maintenance
[11:12:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:12:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:12:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T298555)', diff saved to https://phabricator.wikimedia.org/P28173 and previous config saved to /var/cache/conftool/dbconfig/20220520-111239-ladsgroup.json
[11:12:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:12:47] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[11:14:58] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[11:15:00] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[11:15:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:15:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:16:05] <wikibugs>	 (03PS1) 10Hnowlan: image-suggestion: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/793747 (https://phabricator.wikimedia.org/T304891)
[11:18:47] <wikibugs>	 (03PS1) 10Jbond: rake - spdx: update conversion job to detect contributors [puppet] - 10https://gerrit.wikimedia.org/r/793748 (https://phabricator.wikimedia.org/T308013)
[11:20:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] rake - spdx: update conversion job to detect contributors [puppet] - 10https://gerrit.wikimedia.org/r/793748 (https://phabricator.wikimedia.org/T308013) (owner: 10Jbond)
[11:20:30] <wikibugs>	 (03PS2) 10Jbond: rake - spdx: update conversion job to detect contributors [puppet] - 10https://gerrit.wikimedia.org/r/793748 (https://phabricator.wikimedia.org/T308013)
[11:20:51] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] rake - spdx: update conversion job to detect contributors [puppet] - 10https://gerrit.wikimedia.org/r/793748 (https://phabricator.wikimedia.org/T308013) (owner: 10Jbond)
[11:21:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] rake - spdx: update conversion job to detect contributors [puppet] - 10https://gerrit.wikimedia.org/r/793748 (https://phabricator.wikimedia.org/T308013) (owner: 10Jbond)
[11:21:52] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] WIP: Private APT repo. [puppet] - 10https://gerrit.wikimedia.org/r/793710 (owner: 10Slyngshede)
[11:22:28] <wikibugs>	 (03PS3) 10Jbond: rake - spdx: update conversion job to detect contributors [puppet] - 10https://gerrit.wikimedia.org/r/793748 (https://phabricator.wikimedia.org/T308013)
[11:22:43] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] rake - spdx: update conversion job to detect contributors [puppet] - 10https://gerrit.wikimedia.org/r/793748 (https://phabricator.wikimedia.org/T308013) (owner: 10Jbond)
[11:24:43] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[11:24:44] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[11:24:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T303603)', diff saved to https://phabricator.wikimedia.org/P28174 and previous config saved to /var/cache/conftool/dbconfig/20220520-112449-ladsgroup.json
[11:24:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:56] <stashbot>	 T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603
[11:29:16] <wikibugs>	 (03PS1) 10Muehlenhoff: Also add component/idp-test for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/793751 (https://phabricator.wikimedia.org/T308214)
[11:30:07] <jbond>	 slyngs: happy for me to merge your CR
[11:30:27] <slyngs>	 Yes, it's just for testing
[11:31:05] <slyngs>	 But I was wondering where it went :-)
[11:31:06] <jbond>	 ack on sec im going to quickly send another one with theses
[11:31:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Also add component/idp-test for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/793751 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff)
[11:31:59] <moritzm>	 jbond: you can also merge along my change, then
[11:32:07] <jbond>	 ack will do thanks
[11:32:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T303603)', diff saved to https://phabricator.wikimedia.org/P28175 and previous config saved to /var/cache/conftool/dbconfig/20220520-113207-ladsgroup.json
[11:32:09] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[11:32:11] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[11:32:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:13] <stashbot>	 T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603
[11:32:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:33] <wikibugs>	 (03PS1) 10Jbond: rake: sremove debug statments [puppet] - 10https://gerrit.wikimedia.org/r/793753
[11:32:46] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] rake: sremove debug statments [puppet] - 10https://gerrit.wikimedia.org/r/793753 (owner: 10Jbond)
[11:33:21] <jbond>	 slyngs: moritzm: merge
[11:34:08] <jbond>	 d
[11:36:21] <wikibugs>	 (03PS3) 10JMeybohm: Remove null creationTimestamp from CRDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/792267 (https://phabricator.wikimedia.org/T306165)
[11:36:23] <wikibugs>	 (03PS2) 10JMeybohm: Add crds.yaml fixtures to charts and istio schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/793509 (https://phabricator.wikimedia.org/T306165)
[11:36:25] <wikibugs>	 (03PS8) 10JMeybohm: Replace kubeyaml with kubeconform (if available) [deployment-charts] - 10https://gerrit.wikimedia.org/r/791794 (https://phabricator.wikimedia.org/T306165)
[11:37:09] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: puppet admin: check if additional gropus in systemd::sysuser conflicts with admin.yaml - https://phabricator.wikimedia.org/T308826 (10jbond)
[11:38:55] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: puppet admin: check if additional groups in systemd::sysuser conflicts with admin.yaml - https://phabricator.wikimedia.org/T308826 (10jbond)
[11:40:43] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/793744 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff)
[11:41:51] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance
[11:41:52] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance
[11:41:54] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[11:41:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:41:57] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[11:41:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:42:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:42:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T303603)', diff saved to https://phabricator.wikimedia.org/P28176 and previous config saved to /var/cache/conftool/dbconfig/20220520-114202-ladsgroup.json
[11:42:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:42:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:42:11] <stashbot>	 T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603
[11:42:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T298555)', diff saved to https://phabricator.wikimedia.org/P28177 and previous config saved to /var/cache/conftool/dbconfig/20220520-114234-ladsgroup.json
[11:42:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:42:39] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[11:42:43] <wikibugs>	 (03PS2) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711
[11:43:28] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] mediawiki::system_users: add mwpresync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793418 (https://phabricator.wikimedia.org/T303857) (owner: 10Giuseppe Lavagetto)
[11:43:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711 (owner: 10Jcrespo)
[11:47:07] <wikibugs>	 (03PS1) 10Jbond: P:sretest: test that sysuser can add users to groups managed by admin.yaml [puppet] - 10https://gerrit.wikimedia.org/r/793757 (https://phabricator.wikimedia.org/T308826)
[11:47:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:sretest: test that sysuser can add users to groups managed by admin.yaml [puppet] - 10https://gerrit.wikimedia.org/r/793757 (https://phabricator.wikimedia.org/T308826) (owner: 10Jbond)
[11:48:32] <wikibugs>	 (03PS2) 10Jbond: P:sretest: test that sysuser can add users to groups managed by admin.yaml [puppet] - 10https://gerrit.wikimedia.org/r/793757 (https://phabricator.wikimedia.org/T308826)
[11:48:49] <wikibugs>	 (03PS2) 10Hnowlan: image-suggestion: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/793747 (https://phabricator.wikimedia.org/T304891)
[11:49:10] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35437/console" [puppet] - 10https://gerrit.wikimedia.org/r/793757 (https://phabricator.wikimedia.org/T308826) (owner: 10Jbond)
[11:54:14] <wikibugs>	 10SRE, 10serviceops, 10Continuous-Integration-Config, 10Release-Engineering-Team (CI & Testing services), 10Test-Coverage: Add pcov PHP extension to wikimedia apt so it can be used in Wikimedia CI - https://phabricator.wikimedia.org/T243847 (10Daimona) >>! In T243847#7689119, @Daimona wrote: > Sorry for...
[11:54:41] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2064.codfw.wmnet with OS bullseye
[11:54:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:54:46] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2064.codfw.wmnet with OS bullseye
[11:57:24] <wikibugs>	 (03PS1) 10Ladsgroup: Turn on WRITE BOTH for templatelink migration in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793763 (https://phabricator.wikimedia.org/T299421)
[12:00:30] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] P:sretest: test that sysuser can add users to groups managed by admin.yaml [puppet] - 10https://gerrit.wikimedia.org/r/793757 (https://phabricator.wikimedia.org/T308826) (owner: 10Jbond)
[12:03:06] <wikibugs>	 (03PS1) 10Roman Stolar: [Do Not Merge] Improvements back upstream to have stable version with "thumbor community core" dependency. [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/793765 (https://phabricator.wikimedia.org/T308561)
[12:04:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [Do Not Merge] Improvements back upstream to have stable version with "thumbor community core" dependency. [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/793765 (https://phabricator.wikimedia.org/T308561) (owner: 10Roman Stolar)
[12:05:18] <wikibugs>	 (03PS1) 10Jbond: Revert "P:sretest: test that sysuser can add users to groups managed by admin.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/793602
[12:06:49] <wikibugs>	 (03PS19) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017)
[12:06:51] <wikibugs>	 (03PS3) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711
[12:07:08] <wikibugs>	 (03PS2) 10Roman Stolar: [Do Not Merge] Improvements back upstream to have stable version with "thumbor community core" dependency. [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/793765 (https://phabricator.wikimedia.org/T308561)
[12:07:13] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: puppet admin: check if additional groups in systemd::sysuser conflicts with admin.yaml - https://phabricator.wikimedia.org/T308826 (10jbond) confirmed that the addtional_gropups parameter is not compatible with groups managed by the admin module.  t...
[12:07:17] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Revert "P:sretest: test that sysuser can add users to groups managed by admin.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/793602 (owner: 10Jbond)
[12:07:42] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "P:sretest: test that sysuser can add users to groups managed by admin.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/793602 (owner: 10Jbond)
[12:10:40] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo)
[12:10:54] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2064.codfw.wmnet with reason: host reimage
[12:10:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:11:09] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[12:11:11] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[12:11:12] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] mediawiki::system_users: add mwpresync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793418 (https://phabricator.wikimedia.org/T303857) (owner: 10Giuseppe Lavagetto)
[12:11:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:11:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:11:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T298555)', diff saved to https://phabricator.wikimedia.org/P28178 and previous config saved to /var/cache/conftool/dbconfig/20220520-121116-ladsgroup.json
[12:11:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:11:21] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[12:12:05] <wikibugs>	 (03PS4) 10Hashar: Json schema from Gerrit Java event classes [software/gerrit/jsonschemagenerator] - 10https://gerrit.wikimedia.org/r/791642 (https://phabricator.wikimedia.org/T304947)
[12:13:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Json schema from Gerrit Java event classes [software/gerrit/jsonschemagenerator] - 10https://gerrit.wikimedia.org/r/791642 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar)
[12:13:49] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2064.codfw.wmnet with reason: host reimage
[12:13:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:16:17] <wikibugs>	 (03PS1) 10Stang: commonswiki: Enable wgCopyUploadAllowOnWikiDomainConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793766 (https://phabricator.wikimedia.org/T300407)
[12:21:45] <wikibugs>	 (03PS1) 10Jbond: CONTRIBUTORS: Add my own personal email address [puppet] - 10https://gerrit.wikimedia.org/r/793767
[12:23:27] <Amir1>	 !log killed refreshlinks suggestion in 10160
[12:23:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:23:31] <Amir1>	 in hiwiki
[12:26:40] <wikibugs>	 (03PS4) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711
[12:26:44] <wikibugs>	 (03PS1) 10Slyngshede: WIP: Private APT repo. [puppet] - 10https://gerrit.wikimedia.org/r/793769
[12:29:14] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10Amire80)
[12:30:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T303603)', diff saved to https://phabricator.wikimedia.org/P28179 and previous config saved to /var/cache/conftool/dbconfig/20220520-123037-ladsgroup.json
[12:30:38] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance
[12:30:40] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance
[12:30:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:30:43] <stashbot>	 T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603
[12:30:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T303603)', diff saved to https://phabricator.wikimedia.org/P28180 and previous config saved to /var/cache/conftool/dbconfig/20220520-123045-ladsgroup.json
[12:30:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:30:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:30:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:31:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/793769 (owner: 10Slyngshede)
[12:31:46] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] WIP: Private APT repo. [puppet] - 10https://gerrit.wikimedia.org/r/793769 (owner: 10Slyngshede)
[12:37:32] <moritzm>	 !log copy prometheus-mcrouter-exporter from buster-wikimedia to bullseye-wikimedia (needed for T308214)
[12:37:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:37:37] <stashbot>	 T308214: Migrate the IDPs to Bullseye - https://phabricator.wikimedia.org/T308214
[12:42:45] <logmsgbot>	 !log mforns@deploy1002 Started deploy [airflow-dags/analytics@51a203f]: (no justification provided)
[12:42:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:42:53] <logmsgbot>	 !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@51a203f]: (no justification provided) (duration: 00m 07s)
[12:42:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:44:26] <wikibugs>	 (03PS2) 10TheDJ: Remove unused OggThumbLocation config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791605 (https://phabricator.wikimedia.org/T308191)
[12:44:34] <wikibugs>	 (03PS5) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711
[12:45:00] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable new Bullseye test IDPs in acmechief config [puppet] - 10https://gerrit.wikimedia.org/r/793770 (https://phabricator.wikimedia.org/T308214)
[12:45:08] <wikibugs>	 (03CR) 10ArielGlenn: "Please excuse my drive by comment. But... labstore1006 isn't always the box handling web service. Maybe it makes sense to use a service na" [puppet] - 10https://gerrit.wikimedia.org/r/793525 (https://phabricator.wikimedia.org/T306550) (owner: 10BBlack)
[12:47:09] <wikibugs>	 (03PS2) 10Jbond: CONTRIBUTORS: Add myself and Amir E. Aharoni to CONTRIBUTORS [puppet] - 10https://gerrit.wikimedia.org/r/793767 (https://phabricator.wikimedia.org/T308013)
[12:49:15] <wikibugs>	 (03PS1) 10Jcrespo: django: Add dummy django secret key and mysql pass to test compilation [labs/private] - 10https://gerrit.wikimedia.org/r/793771 (https://phabricator.wikimedia.org/T283017)
[12:50:03] <wikibugs>	 (03CR) 10Jcrespo: [V: 03+2 C: 03+2] django: Add dummy django secret key and mysql pass to test compilation [labs/private] - 10https://gerrit.wikimedia.org/r/793771 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo)
[12:51:25] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] "I'll deploy it on Monday" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791605 (https://phabricator.wikimedia.org/T308191) (owner: 10TheDJ)
[12:52:56] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10zeljkofilipin)
[12:54:36] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2064.codfw.wmnet with OS bullseye
[12:54:39] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2064.codfw.wmnet with OS bullseye completed: - ms-be2064 (**PASS**)   - Downtim...
[12:54:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:54:54] <wikibugs>	 (03PS3) 10Jbond: CONTRIBUTORS: Add myself,Željko Filipin and Amir E. Aharoni to CONTRIBUTORS [puppet] - 10https://gerrit.wikimedia.org/r/793767 (https://phabricator.wikimedia.org/T308013)
[12:55:32] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10santhosh)
[12:55:51] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] CONTRIBUTORS: Add myself,Željko Filipin and Amir E. Aharoni to CONTRIBUTORS [puppet] - 10https://gerrit.wikimedia.org/r/793767 (https://phabricator.wikimedia.org/T308013) (owner: 10Jbond)
[12:56:33] <wikibugs>	 (03PS20) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017)
[12:57:11] <wikibugs>	 (03PS6) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711
[12:59:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo)
[12:59:23] <wikibugs>	 (03PS1) 10Jbond: CONTRIBUTORS: Add Santhosh Thottingal [puppet] - 10https://gerrit.wikimedia.org/r/793772 (https://phabricator.wikimedia.org/T308013)
[13:00:37] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] CONTRIBUTORS: Add Santhosh Thottingal [puppet] - 10https://gerrit.wikimedia.org/r/793772 (https://phabricator.wikimedia.org/T308013) (owner: 10Jbond)
[13:01:38] <wikibugs>	 (03PS1) 10Muehlenhoff: klaxon: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793773 (https://phabricator.wikimedia.org/T308013)
[13:01:40] <wikibugs>	 (03PS1) 10Muehlenhoff: helm/helmfile: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793774 (https://phabricator.wikimedia.org/T308013)
[13:01:42] <wikibugs>	 (03PS1) 10Muehlenhoff: thanos: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793775 (https://phabricator.wikimedia.org/T308013)
[13:01:46] <wikibugs>	 (03PS1) 10Muehlenhoff: thumbor: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793776 (https://phabricator.wikimedia.org/T308013)
[13:01:48] <wikibugs>	 (03PS1) 10Muehlenhoff: amd_rocm: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793777
[13:01:50] <wikibugs>	 (03PS1) 10Muehlenhoff: debian: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793778
[13:04:02] <icinga-wm>	 RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:05:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] debian: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793778 (owner: 10Muehlenhoff)
[13:08:14] <wikibugs>	 (03PS21) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017)
[13:08:54] <wikibugs>	 (03PS1) 10Jbond: rake spdx: add task to list of missing permission contributors [puppet] - 10https://gerrit.wikimedia.org/r/793780 (https://phabricator.wikimedia.org/T308013)
[13:09:32] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] rake spdx: add task to list of missing permission contributors [puppet] - 10https://gerrit.wikimedia.org/r/793780 (https://phabricator.wikimedia.org/T308013) (owner: 10Jbond)
[13:10:02] <wikibugs>	 (03PS7) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711
[13:10:49] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo)
[13:12:00] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:12:25] <wikibugs>	 (03PS1) 10Jbond: rake spdx: make list unique [puppet] - 10https://gerrit.wikimedia.org/r/793781
[13:12:41] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] rake spdx: make list unique [puppet] - 10https://gerrit.wikimedia.org/r/793781 (owner: 10Jbond)
[13:15:34] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2065.codfw.wmnet with OS bullseye
[13:15:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:39] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2065.codfw.wmnet with OS bullseye
[13:17:43] <wikibugs>	 (03PS22) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017)
[13:17:53] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711 (owner: 10Jcrespo)
[13:18:15] <wikibugs>	 (03PS8) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711
[13:18:42] <wikibugs>	 (03CR) 10Jcrespo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/793711 (owner: 10Jcrespo)
[13:20:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo)
[13:23:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T303603)', diff saved to https://phabricator.wikimedia.org/P28181 and previous config saved to /var/cache/conftool/dbconfig/20220520-132307-ladsgroup.json
[13:23:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:11] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance
[13:23:12] <stashbot>	 T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603
[13:23:12] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance
[13:23:14] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 6 hosts with reason: Maintenance
[13:23:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:18] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 6 hosts with reason: Maintenance
[13:23:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:56] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2038.codfw.wmnet,service=ats-be
[13:23:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:01] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2038.codfw.wmnet,service=varnish-fe
[13:24:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:06] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2038.codfw.wmnet,service=ats-tls
[13:24:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:53] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on cp2038.codfw.wmnet with reason: downtimed because of DIMM replacement: T308459
[13:24:56] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on cp2038.codfw.wmnet with reason: downtimed because of DIMM replacement: T308459
[13:24:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:58] <stashbot>	 T308459: codfw: cp2038 Correctable memory error on DIMM A3 - https://phabricator.wikimedia.org/T308459
[13:25:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:08] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "Thanks very much for the patch, LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/793724 (https://phabricator.wikimedia.org/T155761) (owner: 10Volans)
[13:31:43] <wikibugs>	 (03PS9) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711
[13:33:48] <wikibugs>	 (03PS10) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711
[13:35:51] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: db1164 power supply isn't redundant - https://phabricator.wikimedia.org/T308246 (10Jclark-ctr) @Marostegui  sorry i thought Chris had worked on it last week it was physically unplugged and crash cart was in rack.  i have plugged power back into it and it is up
[13:36:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: db1164 power supply isn't redundant - https://phabricator.wikimedia.org/T308246 (10Jclark-ctr) 05Open→03Resolved
[13:36:42] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: db1164 power supply isn't redundant - https://phabricator.wikimedia.org/T308246 (10Marostegui) Thanks John: ` ------------------------------------------------------------------------------- Record:      26 Date/Time:   05/20/2022 13:33:30 Source:      system Severity:    Ok Descrip...
[13:36:54] <wikibugs>	 (03PS23) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017)
[13:36:57] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] klaxon: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793773 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[13:37:04] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] helm/helmfile: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793774 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[13:37:07] <wikibugs>	 (03PS11) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711
[13:37:09] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] thanos: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793775 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[13:37:12] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1164: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/793787
[13:37:14] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] thumbor: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793776 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[13:37:19] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] amd_rocm: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793777 (owner: 10Muehlenhoff)
[13:37:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo)
[13:37:47] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] "Lovely to see this happening!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793766 (https://phabricator.wikimedia.org/T300407) (owner: 10Stang)
[13:38:14] <wikibugs>	 (03PS2) 10Jbond: debian: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793778 (owner: 10Muehlenhoff)
[13:38:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1164 (re)pooling @ 1%: After onsite maintenance', diff saved to https://phabricator.wikimedia.org/P28182 and previous config saved to /var/cache/conftool/dbconfig/20220520-133815-root.json
[13:38:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:38:18] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] debian: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793778 (owner: 10Muehlenhoff)
[13:38:22] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1164: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/793787 (owner: 10Marostegui)
[13:39:21] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/793770 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff)
[13:39:58] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] debian: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793778 (owner: 10Muehlenhoff)
[13:41:12] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: codfw: Provision a server script can not run without a cable ID" - https://phabricator.wikimedia.org/T308768 (10Papaul) @Volans thanks I have another server to install next week i will try and let you know.
[13:41:53] <wikibugs>	 (03PS24) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017)
[13:42:06] <wikibugs>	 (03PS12) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711
[13:42:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo)
[13:43:34] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[13:43:36] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[13:43:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:44:24] <wikibugs>	 (03PS1) 10Muehlenhoff: idp: Remove component/idp-test [puppet] - 10https://gerrit.wikimedia.org/r/793783 (https://phabricator.wikimedia.org/T308214)
[13:44:27] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2065.codfw.wmnet with reason: host reimage
[13:44:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:09] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance
[13:45:11] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance
[13:45:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1118 (T298565)', diff saved to https://phabricator.wikimedia.org/P28183 and previous config saved to /var/cache/conftool/dbconfig/20220520-134515-ladsgroup.json
[13:45:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:23] <wikibugs>	 (03PS25) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017)
[13:45:23] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched on wmf wikis - https://phabricator.wikimedia.org/T298565
[13:45:38] <wikibugs>	 (03PS13) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711
[13:46:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo)
[13:47:56] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] Enable new Bullseye test IDPs in acmechief config [puppet] - 10https://gerrit.wikimedia.org/r/793770 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff)
[13:48:08] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2065.codfw.wmnet with reason: host reimage
[13:48:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:48:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/793775 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[13:49:02] <wikibugs>	 (03PS26) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017)
[13:49:39] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo)
[13:50:16] <wikibugs>	 (03PS27) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017)
[13:51:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:51:58] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:52:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo)
[13:53:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1164 (re)pooling @ 5%: After onsite maintenance', diff saved to https://phabricator.wikimedia.org/P28184 and previous config saved to /var/cache/conftool/dbconfig/20220520-135319-root.json
[13:53:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:53:40] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[13:53:42] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[13:53:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:53:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:53:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T303603)', diff saved to https://phabricator.wikimedia.org/P28185 and previous config saved to /var/cache/conftool/dbconfig/20220520-135350-ladsgroup.json
[13:53:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:53:55] <stashbot>	 T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603
[13:54:34] <wikibugs>	 10SRE, 10ops-codfw, 10Traffic: codfw: cp2038 Correctable memory error on DIMM A3 - https://phabricator.wikimedia.org/T308459 (10ssingh) Hi @Papaul: Thanks for letting us know! The host is depooled and downtimed and so please proceed whenever you want. Thanks!
[13:56:24] <wikibugs>	 (03PS28) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017)
[13:56:54] <wikibugs>	 (03PS14) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711
[13:58:53] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[13:58:54] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[13:58:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:59:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:01:06] <wikibugs>	 (03PS1) 10Jbond: rake: spdx fix rubocop issues [puppet] - 10https://gerrit.wikimedia.org/r/793807
[14:02:18] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] rake: spdx fix rubocop issues [puppet] - 10https://gerrit.wikimedia.org/r/793807 (owner: 10Jbond)
[14:03:00] <wikibugs>	 (03PS3) 10Jbond: debian: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793778 (owner: 10Muehlenhoff)
[14:03:27] <wikibugs>	 (03CR) 10Jbond: "not sure why this was failing CI and others where not, however i have sent a fix and rebased this so it should be green now" [puppet] - 10https://gerrit.wikimedia.org/r/793778 (owner: 10Muehlenhoff)
[14:03:35] <wikibugs>	 (03PS29) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017)
[14:04:36] <wikibugs>	 (03PS30) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017)
[14:05:12] <wikibugs>	 (03PS15) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711
[14:05:20] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[14:06:00] <wikibugs>	 (03PS16) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711
[14:06:39] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/793783 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff)
[14:08:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1164 (re)pooling @ 10%: After onsite maintenance', diff saved to https://phabricator.wikimedia.org/P28186 and previous config saved to /var/cache/conftool/dbconfig/20220520-140823-root.json
[14:08:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:10] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2065.codfw.wmnet with OS bullseye
[14:09:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:14] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2065.codfw.wmnet with OS bullseye completed: - ms-be2065 (**PASS**)   - Downtim...
[14:09:16] <wikibugs>	 (03PS31) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017)
[14:09:33] <wikibugs>	 (03PS17) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711
[14:11:32] <wikibugs>	 (03PS32) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017)
[14:11:46] <wikibugs>	 (03PS18) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711
[14:12:20] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2066.codfw.wmnet with OS bullseye
[14:12:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:12:24] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2066.codfw.wmnet with OS bullseye
[14:13:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T303603)', diff saved to https://phabricator.wikimedia.org/P28187 and previous config saved to /var/cache/conftool/dbconfig/20220520-141308-ladsgroup.json
[14:13:10] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[14:13:11] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[14:13:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:13:14] <stashbot>	 T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603
[14:13:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T303603)', diff saved to https://phabricator.wikimedia.org/P28188 and previous config saved to /var/cache/conftool/dbconfig/20220520-141316-ladsgroup.json
[14:13:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:13:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:13:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:13:34] <wikibugs>	 (03PS33) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017)
[14:13:51] <wikibugs>	 (03PS19) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711
[14:15:03] <wikibugs>	 (03PS1) 10Hnowlan: CONTRIBUTORS: add hnowlan entry [puppet] - 10https://gerrit.wikimedia.org/r/793811 (https://phabricator.wikimedia.org/T308013)
[14:15:31] <wikibugs>	 (03PS34) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017)
[14:15:40] <wikibugs>	 (03PS20) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711
[14:18:42] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[14:20:17] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "super ignorant about SPDX - are we talking about the license for the Puppet module, or the ROCm suite?" [puppet] - 10https://gerrit.wikimedia.org/r/793777 (owner: 10Muehlenhoff)
[14:20:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T303603)', diff saved to https://phabricator.wikimedia.org/P28189 and previous config saved to /var/cache/conftool/dbconfig/20220520-142032-ladsgroup.json
[14:20:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:20:38] <stashbot>	 T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603
[14:20:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/793811 (https://phabricator.wikimedia.org/T308013) (owner: 10Hnowlan)
[14:22:33] <wikibugs>	 (03PS9) 10Jbond: C:redis::multidc::ipsec: migrate legacy redis_shard_hosts to puppet code [puppet] - 10https://gerrit.wikimedia.org/r/793113 (https://phabricator.wikimedia.org/T308639)
[14:23:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1164 (re)pooling @ 25%: After onsite maintenance', diff saved to https://phabricator.wikimedia.org/P28190 and previous config saved to /var/cache/conftool/dbconfig/20220520-142327-root.json
[14:23:30] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35457/console" [puppet] - 10https://gerrit.wikimedia.org/r/793113 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond)
[14:23:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:34] <wikibugs>	 (03PS1) 10Filippo Giunchedi: lvs: stop double-checking catalog services from Icinga [puppet] - 10https://gerrit.wikimedia.org/r/793815 (https://phabricator.wikimedia.org/T291946)
[14:24:36] <wikibugs>	 (03PS1) 10Filippo Giunchedi: icinga: deprecate service::monitor class [puppet] - 10https://gerrit.wikimedia.org/r/793816 (https://phabricator.wikimedia.org/T291946)
[14:24:39] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Pretty great! See my small comments but LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/791794 (https://phabricator.wikimedia.org/T306165) (owner: 10JMeybohm)
[14:24:40] <wikibugs>	 (03PS1) 10Filippo Giunchedi: icinga: remove 'monitoring' from service::catalog [puppet] - 10https://gerrit.wikimedia.org/r/793817 (https://phabricator.wikimedia.org/T291946)
[14:25:14] <wikibugs>	 (03CR) 10Jcrespo: "Hello, @moritz" [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo)
[14:26:43] <wikibugs>	 (03PS1) 10Volans: sre.hosts.reimage: DHCP workaround for row E/F [cookbooks] - 10https://gerrit.wikimedia.org/r/793818 (https://phabricator.wikimedia.org/T306421)
[14:28:36] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2066.codfw.wmnet with reason: host reimage
[14:28:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:29:36] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/793811 (https://phabricator.wikimedia.org/T308013) (owner: 10Hnowlan)
[14:29:38] <wikibugs>	 (03PS35) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017)
[14:30:09] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] C:redis::multidc::ipsec: migrate legacy redis_shard_hosts to puppet code [puppet] - 10https://gerrit.wikimedia.org/r/793113 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond)
[14:30:19] <wikibugs>	 (03PS21) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711
[14:31:19] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] P:profile::redis::multidc: drop legacy function redis_get_instances [puppet] - 10https://gerrit.wikimedia.org/r/793111 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond)
[14:31:59] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2066.codfw.wmnet with reason: host reimage
[14:32:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:32:35] <wikibugs>	 (03PS36) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017)
[14:32:57] <wikibugs>	 (03PS22) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711
[14:38:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1164 (re)pooling @ 50%: After onsite maintenance', diff saved to https://phabricator.wikimedia.org/P28191 and previous config saved to /var/cache/conftool/dbconfig/20220520-143830-root.json
[14:38:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:41:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T298565)', diff saved to https://phabricator.wikimedia.org/P28192 and previous config saved to /var/cache/conftool/dbconfig/20220520-144111-ladsgroup.json
[14:41:15] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance
[14:41:16] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance
[14:41:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:41:18] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched on wmf wikis - https://phabricator.wikimedia.org/T298565
[14:41:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:41:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298555)', diff saved to https://phabricator.wikimedia.org/P28193 and previous config saved to /var/cache/conftool/dbconfig/20220520-144212-ladsgroup.json
[14:42:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:17] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2121.codfw.wmnet with reason: Maintenance
[14:42:18] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[14:42:19] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2121.codfw.wmnet with reason: Maintenance
[14:42:20] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on 10 hosts with reason: Maintenance
[14:42:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:27] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on 10 hosts with reason: Maintenance
[14:42:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:35] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/793818 (https://phabricator.wikimedia.org/T306421) (owner: 10Volans)
[14:45:41] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] sre: port mediawiki php-fpm saturation alert [alerts] - 10https://gerrit.wikimedia.org/r/791356 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[14:46:03] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2066.codfw.wmnet with OS bullseye
[14:46:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:46:07] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2066.codfw.wmnet with OS bullseye completed: - ms-be2066 (**PASS**)   - Downtim...
[14:49:37] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "Looks great!  Thanks for taking the time volans :).  This may occur on cloudsw1-e4/f4 given the criteria, but I don't believe there is any" [cookbooks] - 10https://gerrit.wikimedia.org/r/793818 (https://phabricator.wikimedia.org/T306421) (owner: 10Volans)
[14:50:56] <wikibugs>	 (03PS2) 10BBlack: [WIP] esitest service [puppet] - 10https://gerrit.wikimedia.org/r/793561 (https://phabricator.wikimedia.org/T308799)
[14:52:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] esitest service [puppet] - 10https://gerrit.wikimedia.org/r/793561 (https://phabricator.wikimedia.org/T308799) (owner: 10BBlack)
[14:53:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1164 (re)pooling @ 75%: After onsite maintenance', diff saved to https://phabricator.wikimedia.org/P28194 and previous config saved to /var/cache/conftool/dbconfig/20220520-145334-root.json
[14:53:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:53:49] <wikibugs>	 (03CR) 10JMeybohm: Replace kubeyaml with kubeconform (if available) (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/791794 (https://phabricator.wikimedia.org/T306165) (owner: 10JMeybohm)
[14:54:46] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2067.codfw.wmnet with OS bullseye
[14:54:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:54:50] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2067.codfw.wmnet with OS bullseye
[14:57:47] <wikibugs>	 (03PS9) 10JMeybohm: Replace kubeyaml with kubeconform (if available) [deployment-charts] - 10https://gerrit.wikimedia.org/r/791794 (https://phabricator.wikimedia.org/T306165)
[14:57:59] <wikibugs>	 (03CR) 10JMeybohm: Replace kubeyaml with kubeconform (if available) (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/791794 (https://phabricator.wikimedia.org/T306165) (owner: 10JMeybohm)
[14:59:48] <wikibugs>	 10SRE, 10ops-codfw, 10Traffic: codfw: cp2038 Correctable memory error on DIMM A3 - https://phabricator.wikimedia.org/T308459 (10Papaul) @ssingh thanks will work on it when back on site next week
[15:02:17] <wikibugs>	 (03PS1) 10Jbond: WIP: port kafka_config to puppet function [puppet] - 10https://gerrit.wikimedia.org/r/793821
[15:02:58] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP: port kafka_config to puppet function [puppet] - 10https://gerrit.wikimedia.org/r/793821 (owner: 10Jbond)
[15:04:23] <wikibugs>	 (03PS1) 10Jbond: CONTRIBUTORS: add Ahmon Dancy [puppet] - 10https://gerrit.wikimedia.org/r/793822 (https://phabricator.wikimedia.org/T308013)
[15:04:43] <wikibugs>	 (03PS1) 10Ryan Kemper: Contributors: Add ryan kemper [puppet] - 10https://gerrit.wikimedia.org/r/793823
[15:05:41] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] CONTRIBUTORS: add hnowlan entry [puppet] - 10https://gerrit.wikimedia.org/r/793811 (https://phabricator.wikimedia.org/T308013) (owner: 10Hnowlan)
[15:06:02] <wikibugs>	 (03PS2) 10Jbond: CONTRIBUTORS: add Ahmon Dancy [puppet] - 10https://gerrit.wikimedia.org/r/793822 (https://phabricator.wikimedia.org/T308013)
[15:06:26] <wikibugs>	 (03PS2) 10Jbond: Contributors: Add ryan kemper [puppet] - 10https://gerrit.wikimedia.org/r/793823 (owner: 10Ryan Kemper)
[15:06:30] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[15:06:33] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] Contributors: Add ryan kemper [puppet] - 10https://gerrit.wikimedia.org/r/793823 (owner: 10Ryan Kemper)
[15:06:56] <wikibugs>	 (03PS3) 10Jbond: CONTRIBUTORS: add Ahmon Dancy [puppet] - 10https://gerrit.wikimedia.org/r/793822 (https://phabricator.wikimedia.org/T308013)
[15:08:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1164 (re)pooling @ 100%: After onsite maintenance', diff saved to https://phabricator.wikimedia.org/P28195 and previous config saved to /var/cache/conftool/dbconfig/20220520-150838-root.json
[15:08:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:09:54] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] CONTRIBUTORS: add Ahmon Dancy [puppet] - 10https://gerrit.wikimedia.org/r/793822 (https://phabricator.wikimedia.org/T308013) (owner: 10Jbond)
[15:10:57] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] CONTRIBUTORS: add Ahmon Dancy [puppet] - 10https://gerrit.wikimedia.org/r/793822 (https://phabricator.wikimedia.org/T308013) (owner: 10Jbond)
[15:11:26] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2067.codfw.wmnet with reason: host reimage
[15:11:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:48] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[15:14:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1118 T', diff saved to https://phabricator.wikimedia.org/P28196 and previous config saved to /var/cache/conftool/dbconfig/20220520-151407-ladsgroup.json
[15:14:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:14:23] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2067.codfw.wmnet with reason: host reimage
[15:14:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:00] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2068.codfw.wmnet with OS bullseye
[15:17:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:04] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2068.codfw.wmnet with OS bullseye
[15:19:17] <wikibugs>	 (03PS5) 10Hashar: Json schema from Gerrit Java event classes [software/gerrit/jsonschemagenerator] - 10https://gerrit.wikimedia.org/r/791642 (https://phabricator.wikimedia.org/T304947)
[15:19:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Json schema from Gerrit Java event classes [software/gerrit/jsonschemagenerator] - 10https://gerrit.wikimedia.org/r/791642 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar)
[15:20:44] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:20:46] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:27:07] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/793520 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney)
[15:28:22] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2067.codfw.wmnet with OS bullseye
[15:28:22] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance
[15:28:24] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance
[15:28:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:28:26] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2067.codfw.wmnet with OS bullseye completed: - ms-be2067 (**PASS**)   - Downtim...
[15:28:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:28:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:29:48] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2069.codfw.wmnet with OS bullseye
[15:29:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:29:53] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2069.codfw.wmnet with OS bullseye
[15:33:22] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2068.codfw.wmnet with reason: host reimage
[15:33:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:36:46] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2068.codfw.wmnet with reason: host reimage
[15:36:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:51] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[15:43:25] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[15:46:20] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2069.codfw.wmnet with reason: host reimage
[15:46:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:49:59] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2069.codfw.wmnet with reason: host reimage
[15:50:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:44] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] image-suggestion: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/793747 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan)
[15:54:27] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2068.codfw.wmnet with OS bullseye
[15:54:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:54:31] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2068.codfw.wmnet with OS bullseye completed: - ms-be2068 (**PASS**)   - Downtim...
[15:58:33] <wikibugs>	 (03Merged) 10jenkins-bot: image-suggestion: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/793747 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan)
[15:58:42] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/image-suggestion: sync
[15:58:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:43] <icinga-wm>	 PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:00:21] <wikibugs>	 (03CR) 10Tchanders: [C: 03+1] Add SimilarEditors extension – II: Add to InitialiseSettings, default off [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793500 (https://phabricator.wikimedia.org/T306909) (owner: 10Jforrester)
[16:00:49] <wikibugs>	 (03CR) 10Tchanders: [C: 03+1] Add SimilarEditors extension – III: Add to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793501 (https://phabricator.wikimedia.org/T306909) (owner: 10Jforrester)
[16:02:18] <wikibugs>	 (03PS2) 10Jbond: P:kafka: drop legacy kefka_config and kafka_config_name functions [puppet] - 10https://gerrit.wikimedia.org/r/793821 (https://phabricator.wikimedia.org/T308639)
[16:03:51] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2069.codfw.wmnet with OS bullseye
[16:03:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:03:56] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2069.codfw.wmnet with OS bullseye completed: - ms-be2069 (**PASS**)   - Downtim...
[16:04:36] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon)
[16:05:10] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon) All production and pre-production codfw backends done.
[16:05:53] <icinga-wm>	 PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:08:45] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/image-suggestion: sync
[16:08:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:09:25] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/image-suggestion: apply
[16:09:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:17:01] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance
[16:17:03] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance
[16:17:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:17:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:17:14] <wikibugs>	 (03PS1) 10Hnowlan: aqs: allow Kubernetes nodes access to cassandra [puppet] - 10https://gerrit.wikimedia.org/r/793839 (https://phabricator.wikimedia.org/T304891)
[16:19:29] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/image-suggestion: apply
[16:19:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:21:38] <wikibugs>	 (03PS1) 10Tchanders: Deploy IPInfo to all wikis by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793841 (https://phabricator.wikimedia.org/T260597)
[16:26:44] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[16:26:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:31:03] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35463/console" [puppet] - 10https://gerrit.wikimedia.org/r/793839 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan)
[16:31:14] <wikibugs>	 (03PS1) 10Tchanders: Remove outdated comment about IPInfo from CommonSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793848 (https://phabricator.wikimedia.org/T308876)
[16:31:16] <wikibugs>	 (03PS1) 10Tchanders: Add comment to consult Legal before updating IPInfo access [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793849 (https://phabricator.wikimedia.org/T308876)
[16:32:43] <wikibugs>	 (03PS2) 10Hnowlan: aqs: allow Kubernetes nodes access to cassandra [puppet] - 10https://gerrit.wikimedia.org/r/793839 (https://phabricator.wikimedia.org/T304891)
[16:33:55] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35464/console" [puppet] - 10https://gerrit.wikimedia.org/r/793839 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan)
[16:35:00] <icinga-wm>	 PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:37:14] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti5003.eqsin.wmnet with OS bullseye
[16:37:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:37:19] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host ganeti5003.eqsin.wmnet with OS bullseye
[16:38:07] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 (10RobH) I just overwrote the password with the exact same password, as described in the comment you linked.  It didn't fix it, so I checked the settings, and it seems perhaps the firmware load dis...
[16:41:16] <wikibugs>	 (03CR) 10JHathaway: sre: port mx queue high page (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/792564 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[16:55:50] <wikibugs>	 (03CR) 10JHathaway: dumps: remove generic python 2.25.1 user agent block (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793550 (owner: 10JHathaway)
[16:57:16] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[16:57:18] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[16:57:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:57:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:57:58] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:58:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:58:17] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[16:58:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:04:22] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti5003.eqsin.wmnet with reason: host reimage
[17:04:25] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:04:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:04:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:05:26] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance
[17:05:28] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance
[17:05:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:05:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:21] <icinga-wm>	 RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:07:53] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti5003.eqsin.wmnet with reason: host reimage
[17:07:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:15:29] <icinga-wm>	 PROBLEM - BGP status on cloudsw1-c8-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Connect - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:17:13] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10MarkTraceur)
[17:27:55] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 (10MoritzMuehlenhoff) Thanks!
[17:28:07] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti5003.eqsin.wmnet with OS bullseye
[17:28:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:28:11] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host ganeti5003.eqsin.wmnet with OS bullseye completed: - ganeti5003 (**WARN**)   - Downtimed on Icinga/Ale...
[17:32:10] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 (10RobH) a:05RobH→03MoritzMuehlenhoff @MoritzMuehlenhoff,  The warn is due to the host being in bios when i fired off the script, so it couldn't disable puppet on the old OS.  This host is now...
[17:32:26] <wikibugs>	 (03CR) 10Dzahn: sre: port mx queue high page (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/792564 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[17:35:24] <icinga-wm>	 RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:46:40] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[17:46:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:51:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:51:55] <jinxer-wm>	 (LogstashIngestSpike) firing: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[17:53:16] <icinga-wm>	 RECOVERY - BGP status on cloudsw1-c8-eqiad.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:53:29] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance
[17:53:30] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance
[17:53:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:53:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:55:55] <mutante>	 !log [mwmaint1002:~] $ sudo mwscript initSiteStats.php --wiki=kcgwiki --update  (to update statistics for latest wikipedia kcg) T305281
[17:55:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:55:59] <stashbot>	 T305281: Post-creation work for kcgwiki - https://phabricator.wikimedia.org/T305281
[17:56:55] <jinxer-wm>	 (LogstashIngestSpike) resolved: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[17:58:07] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Finalise design extension of WMCS networks to new cloudsw in Eqiad rows E/F - https://phabricator.wikimedia.org/T304989 (10cmooney) Just a brief update here.  I've completed the migration of the existing cloud realm networks configured on c...
[18:00:57] <wikibugs>	 (03PS1) 10Cathal Mooney: Change cloudsw loopback filter to common one [homer/public] - 10https://gerrit.wikimedia.org/r/793855 (https://phabricator.wikimedia.org/T304989)
[18:01:14] <icinga-wm>	 RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:01:55] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Change cloudsw loopback filter to common one [homer/public] - 10https://gerrit.wikimedia.org/r/793855 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney)
[18:02:30] <wikibugs>	 (03Merged) 10jenkins-bot: Change cloudsw loopback filter to common one [homer/public] - 10https://gerrit.wikimedia.org/r/793855 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney)
[18:04:29] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:04:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:07:50] <icinga-wm>	 PROBLEM - SSH on ms-be1066 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:09:24] <icinga-wm>	 PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:09:58] <icinga-wm>	 RECOVERY - SSH on ms-be1066 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:11:40] <icinga-wm>	 RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:13:00] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:14:36] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:15:10] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.337 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:16:20] <wikibugs>	 (03PS1) 10Volans: ganeti: add set_boot_media() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/793856 (https://phabricator.wikimedia.org/T306661)
[18:16:46] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48108 bytes in 0.326 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:17:12] <wikibugs>	 (03CR) 10Volans: "Adding support for https://wikitech.wikimedia.org/wiki/Ganeti#Set_boot_order_to_disk" [software/spicerack] - 10https://gerrit.wikimedia.org/r/793856 (https://phabricator.wikimedia.org/T306661) (owner: 10Volans)
[18:19:45] <wikibugs>	 (03PS3) 10BBlack: [WIP] esitest service [puppet] - 10https://gerrit.wikimedia.org/r/793561 (https://phabricator.wikimedia.org/T308799)
[18:23:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] ganeti: add set_boot_media() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/793856 (https://phabricator.wikimedia.org/T306661) (owner: 10Volans)
[18:34:42] <wikibugs>	 (03CR) 10Dwisehaupt: [C: 03+1] "Ha, yes. Fix the typo." [dns] - 10https://gerrit.wikimedia.org/r/793725 (https://phabricator.wikimedia.org/T308672) (owner: 10Volans)
[18:36:30] <wikibugs>	 (03CR) 10Dwisehaupt: [C: 03+1] "This look correct and appropriate to me" [dns] - 10https://gerrit.wikimedia.org/r/793726 (https://phabricator.wikimedia.org/T308672) (owner: 10Volans)
[18:41:17] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance
[18:41:18] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance
[18:41:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:41:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:49:52] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:52:10] <wikibugs>	 (03PS1) 10Zabe: ulogd: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793858 (https://phabricator.wikimedia.org/T308013)
[18:52:12] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1062 is CRITICAL: CRITICAL - degraded: The following units failed: session-243916.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:54:48] <wikibugs>	 (03PS1) 10Zabe: udev: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793859 (https://phabricator.wikimedia.org/T308013)
[18:57:14] <wikibugs>	 (03PS1) 10Zabe: trafficserver: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793860 (https://phabricator.wikimedia.org/T308013)
[19:00:26] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1033 is CRITICAL: CRITICAL - degraded: The following units failed: session-338037.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:04:08] <icinga-wm>	 PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:04:17] <wikibugs>	 (03CR) 10Ori: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/793862 (https://phabricator.wikimedia.org/T295698) (owner: 10Ori)
[19:04:34] <wikibugs>	 (03PS3) 10Ori: New service: function-evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/793862 (https://phabricator.wikimedia.org/T295698)
[19:06:27] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[19:06:29] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[19:06:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:06:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:06:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T298555)', diff saved to https://phabricator.wikimedia.org/P28198 and previous config saved to /var/cache/conftool/dbconfig/20220520-190633-ladsgroup.json
[19:06:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:06:40] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[19:29:18] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[19:45:19] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "being bold, just merging it and testing it on gitlab1004 .. since this is also a bit time constrained with the delays to get stuff into pr" [puppet] - 10https://gerrit.wikimedia.org/r/793534 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto)
[19:49:20] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[20:01:26] <wikibugs>	 (03PS1) 10Ladsgroup: Make IS.php return an array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793869
[20:02:45] <wikibugs>	 (03PS1) 10Ladsgroup: Make CommonSettings load the array from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871
[20:05:14] <icinga-wm>	 RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:13:16] <icinga-wm>	 PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:13:34] <wikibugs>	 (03PS2) 10Krinkle: wmf-config: Move loading/computing of $globals to a method for profiling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749762 (https://phabricator.wikimedia.org/T169821)
[20:15:01] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmf-config: Move loading/computing of $globals to a method for profiling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749762 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle)
[20:15:12] <wikibugs>	 (03PS1) 10Ladsgroup: Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873
[20:16:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 (owner: 10Ladsgroup)
[20:19:47] <wikibugs>	 (03PS1) 10Dduvall: WIP docker_registry_ha: Support GitLab JSON Web Token auth [puppet] - 10https://gerrit.wikimedia.org/r/793875 (https://phabricator.wikimedia.org/T308501)
[20:20:15] <wikibugs>	 (03PS3) 10Krinkle: wmf-config: Move loading/computing of $globals to a method for profiling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749762 (https://phabricator.wikimedia.org/T169821)
[20:20:17] <wikibugs>	 (03PS2) 10Krinkle: MWConfigCacheGenerator: Move siteFromDB() deeper down the stack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749763 (https://phabricator.wikimedia.org/T169821)
[20:20:40] <wikibugs>	 (03PS2) 10Ladsgroup: Make CommonSettings load the array from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871
[20:20:42] <wikibugs>	 (03PS2) 10Ladsgroup: Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873
[20:22:12] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmf-config: Move loading/computing of $globals to a method for profiling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749762 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle)
[20:22:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] MWConfigCacheGenerator: Move siteFromDB() deeper down the stack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749763 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle)
[20:22:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 (owner: 10Ladsgroup)
[20:29:19] <wikibugs>	 (03CR) 10Krinkle: Move out ORES extension configuration out of InitialiseSettings.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 (owner: 10Ladsgroup)
[20:30:16] <wikibugs>	 (03PS3) 10Ladsgroup: Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873
[20:30:21] <wikibugs>	 (03CR) 10Ladsgroup: Move out ORES extension configuration out of InitialiseSettings.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 (owner: 10Ladsgroup)
[20:31:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 (owner: 10Ladsgroup)
[20:34:06] <wikibugs>	 (03PS4) 10Ladsgroup: Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873
[20:36:04] <wikibugs>	 (03PS4) 10Krinkle: wmf-config: Move loading/computing of $globals to a method for profiling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749762 (https://phabricator.wikimedia.org/T169821)
[20:36:06] <wikibugs>	 (03PS3) 10Krinkle: MWConfigCacheGenerator: Move siteFromDB() deeper down the stack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749763 (https://phabricator.wikimedia.org/T169821)
[20:42:52] <wikibugs>	 10SRE, 10MediaWiki-extensions-CodeReview, 10Platform Engineering, 10serviceops-radar, 10Patch-For-Review: Make an HTML dump of the output of the CodeReview extension on MediaWiki.org - https://phabricator.wikimedia.org/T205361 (10Legoktm) >>! In T205361#7815573, @Krinkle wrote: >>>! In T205361#7815521, @...
[20:50:14] <wikibugs>	 (03PS2) 10Volans: ganeti: add set_boot_media() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/793856 (https://phabricator.wikimedia.org/T306661)
[21:00:23] <wikibugs>	 10SRE-tools, 10Discovery, 10Discovery-Search, 10Infrastructure-Foundations, 10IPv6: Some elastic hosts do not have IPv6 DNS records - https://phabricator.wikimedia.org/T271143 (10bking) Hey Volans, sorry I didn't get to this by end of week as promised; I was sick on Weds and Thurs. Starting Monday, some...
[21:10:14] <wikibugs>	 (03PS3) 10Ladsgroup: Make CommonSettings load the array from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871
[21:10:16] <wikibugs>	 (03PS5) 10Ladsgroup: Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873
[21:12:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 (owner: 10Ladsgroup)
[21:12:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Make CommonSettings load the array from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871 (owner: 10Ladsgroup)
[21:13:41] <wikibugs>	 (03PS4) 10Ladsgroup: Make CommonSettings load the array from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871
[21:13:43] <wikibugs>	 (03PS6) 10Ladsgroup: Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873
[21:15:22] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 (owner: 10Ladsgroup)
[21:15:39] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Make CommonSettings load the array from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871 (owner: 10Ladsgroup)
[21:19:37] <wikibugs>	 (03PS5) 10Ladsgroup: Make CommonSettings load the array from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871
[21:19:39] <wikibugs>	 (03PS7) 10Ladsgroup: Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873
[21:21:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 (owner: 10Ladsgroup)
[21:21:19] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Make CommonSettings load the array from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871 (owner: 10Ladsgroup)
[21:22:49] <wikibugs>	 (03PS8) 10Ladsgroup: Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873
[21:23:58] <wikibugs>	 (03PS6) 10Ladsgroup: Make CommonSettings load the array from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871
[21:24:00] <wikibugs>	 (03PS9) 10Ladsgroup: Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873
[21:24:32] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 (owner: 10Ladsgroup)
[21:25:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Make CommonSettings load the array from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871 (owner: 10Ladsgroup)
[21:25:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 (owner: 10Ladsgroup)
[21:26:36] <wikibugs>	 (03PS7) 10Ladsgroup: Make CommonSettings load the array from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871
[21:26:38] <wikibugs>	 (03PS10) 10Ladsgroup: Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873
[21:27:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 (owner: 10Ladsgroup)
[21:31:14] <wikibugs>	 (03PS11) 10Ladsgroup: Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873
[21:33:56] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on gitlab1004.wikimedia.org with reason: reimage
[21:33:58] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gitlab1004.wikimedia.org with reason: reimage
[21:34:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:34:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:34:38] <mutante>	 !log reimaging gitlab1004 (insetup) to test partman recipe from gerrit:793534 - T307142
[21:34:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:34:42] <stashbot>	 T307142: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142
[21:36:05] <mutante>	 !log attempt to use reimage cookbook failed: spicerack.netbox.NetboxHostNotFoundError
[21:36:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:36:10] <mutante>	 !log attempt to use reimage cookbook failed: spicerack.netbox.NetboxHostNotFoundError T307142
[21:36:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:37:20] <mutante>	 !log correction: mistake was to use FQDN T307142
[21:37:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:38:26] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab1004.wikimedia.org with OS bullseye
[21:38:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:50:20] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab1004.wikimedia.org with reason: host reimage
[21:50:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:51:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:54:50] <wikibugs>	 (03CR) 10Krinkle: Make CommonSettings load the array from IS.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871 (owner: 10Ladsgroup)
[21:55:13] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab1004.wikimedia.org with reason: host reimage
[21:55:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T298555)', diff saved to https://phabricator.wikimedia.org/P28201 and previous config saved to /var/cache/conftool/dbconfig/20220520-215514-ladsgroup.json
[21:55:16] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[21:55:16] <wikibugs>	 (03CR) 10Krinkle: Make CommonSettings load the array from IS.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871 (owner: 10Ladsgroup)
[21:55:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:55:17] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[21:55:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:55:23] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[21:55:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:55:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:00:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 10%: Maint finished', diff saved to https://phabricator.wikimedia.org/P28202 and previous config saved to /var/cache/conftool/dbconfig/20220520-220046-ladsgroup.json
[22:00:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:02:46] <wikibugs>	 (03PS8) 10Ladsgroup: Make CommonSettings load the array from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871
[22:02:49] <wikibugs>	 (03PS12) 10Ladsgroup: Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873
[22:03:37] <wikibugs>	 (03CR) 10Ladsgroup: Make CommonSettings load the array from IS.php (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871 (owner: 10Ladsgroup)
[22:06:52] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab1004.wikimedia.org with OS bullseye
[22:06:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:07:36] <icinga-wm>	 PROBLEM - Disk space on dumpsdata1003 is CRITICAL: DISK CRITICAL - free space: /data 877698 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops
[22:08:00] <wikibugs>	 10SRE-tools, 10Discovery, 10Discovery-Search, 10Infrastructure-Foundations, 10IPv6: Some elastic hosts do not have IPv6 DNS records - https://phabricator.wikimedia.org/T271143 (10Volans) Sounds good to me. Thanks for the update :)
[22:15:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 25%: Maint finished', diff saved to https://phabricator.wikimedia.org/P28203 and previous config saved to /var/cache/conftool/dbconfig/20220520-221550-ladsgroup.json
[22:15:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:24:16] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[22:24:18] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[22:24:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:24:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:27:31] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to `wmf` for `Dmantena` - https://phabricator.wikimedia.org/T308294 (10Dmantena) Thanks for this information! Frankly, I'd prefer //not// to have production shell access and these elevated permissions. I'm just after a snapshot of the iOS notifications event dashboar...
[22:30:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 75%: Maint finished', diff saved to https://phabricator.wikimedia.org/P28204 and previous config saved to /var/cache/conftool/dbconfig/20220520-223054-ladsgroup.json
[22:30:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:31:29] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:33:24] <wikibugs>	 10SRE, 10Analytics, 10LDAP-Access-Requests: Grant Access to `wmf` for `Dmantena` - https://phabricator.wikimedia.org/T308294 (10Dzahn) I think we should escalate this directly to the analytics team for advice how to move forward. Let me add them.
[22:33:28] <wikibugs>	 10SRE, 10Analytics, 10LDAP-Access-Requests: Grant Access to `wmf` for `Dmantena` - https://phabricator.wikimedia.org/T308294 (10Dzahn)
[22:43:57] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1060 is CRITICAL: CRITICAL - degraded: The following units failed: session-337818.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:45:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 100%: Maint finished', diff saved to https://phabricator.wikimedia.org/P28205 and previous config saved to /var/cache/conftool/dbconfig/20220520-224558-ladsgroup.json
[22:46:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:08:07] <icinga-wm>	 PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:16:21] <icinga-wm>	 RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:22:00] <wikibugs>	 (03PS1) 10Stang: rowiki: Use Romanian canonical name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793999 (https://phabricator.wikimedia.org/T127607)
[23:32:30] <wikibugs>	 (03PS1) 10Stang: Update IP addresses for Wiki Education Dashboard exemptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/794000 (https://phabricator.wikimedia.org/T308702)
[23:42:32] <wikibugs>	 (03PS19) 10Stang: Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789)
[23:42:59] <wikibugs>	 (03PS20) 10Stang: Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789)
[23:43:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789) (owner: 10Stang)
[23:45:18] <wikibugs>	 (03PS21) 10Stang: Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789)
[23:46:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789) (owner: 10Stang)
[23:53:52] <wikibugs>	 (03PS22) 10Stang: Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789)
[23:54:20] <wikibugs>	 (03PS23) 10Stang: Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789)