[00:01:20] RECOVERY - Check systemd state on grafana1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:03:32] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:03:35] (03CR) 10Dzahn: [V: 03+1 C: 03+2] delete expired certs etcd.eqiad.wmnet.crt and etcd.codfw.wmnet.crt [puppet] - 10https://gerrit.wikimedia.org/r/791671 (https://phabricator.wikimedia.org/T307382) (owner: 10Dzahn) [00:05:38] PROBLEM - purged service on cp3060 is CRITICAL: CRITICAL - Expecting active but unit purged is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:11:34] PROBLEM - Check systemd state on an-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-clean-fairscheduler-event-logs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:22:21] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "noop on conf* and elsewhere like kubetcd*" [puppet] - 10https://gerrit.wikimedia.org/r/791671 (https://phabricator.wikimedia.org/T307382) (owner: 10Dzahn) [00:22:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298560)', diff saved to https://phabricator.wikimedia.org/P28159 and previous config saved to /var/cache/conftool/dbconfig/20220520-002227-ladsgroup.json [00:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:34] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [00:23:57] (03CR) 10Dzahn: "I had similar ones back in 2017 but eventually gave up. https://gerrit.wikimedia.org/r/q/topic:expired-certs" [puppet] - 10https://gerrit.wikimedia.org/r/791673 (owner: 10Dzahn) [00:27:57] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host netmon1003.wikimedia.org with OS bullseye [00:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:28:03] 10SRE, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install netmon1003 - https://phabricator.wikimedia.org/T299106 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host netmon1003.wikimedia.org with OS bullseye [00:29:31] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on netmon1003.wikimedia.org with reason: host reimage [00:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:00] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:33:04] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netmon1003.wikimedia.org with reason: host reimage [00:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:37:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P28160 and previous config saved to /var/cache/conftool/dbconfig/20220520-003732-ladsgroup.json [00:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:14] RECOVERY - SSH on labweb1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:40:39] (03CR) 10Dzahn: "I am not very familiar with the partman syntax but the easiest way forward here is to just try it and reimage gitlab1004 which is still "i" [puppet] - 10https://gerrit.wikimedia.org/r/793534 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [00:44:55] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netmon1003.wikimedia.org with OS bullseye [00:44:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:01] 10SRE, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install netmon1003 - https://phabricator.wikimedia.org/T299106 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host netmon1003.wikimedia.org with OS bullseye completed: - netmon1003 (**WARN**) - Downtimed... [00:52:29] 10SRE, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install netmon1003 - https://phabricator.wikimedia.org/T299106 (10Papaul) [00:52:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P28161 and previous config saved to /var/cache/conftool/dbconfig/20220520-005237-ladsgroup.json [00:52:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:45] 10SRE, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install netmon1003 - https://phabricator.wikimedia.org/T299106 (10Papaul) a:05Papaul→03Jclark-ctr @Jclark-ctr complete [01:07:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298560)', diff saved to https://phabricator.wikimedia.org/P28162 and previous config saved to /var/cache/conftool/dbconfig/20220520-010743-ladsgroup.json [01:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:07:48] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [01:09:18] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [01:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:23:19] (03PS1) 10Cathal Mooney: Automtation changes to match cloudsw config after migration. [homer/public] - 10https://gerrit.wikimedia.org/r/793568 (https://phabricator.wikimedia.org/T304989) [01:23:56] (03CR) 10jerkins-bot: [V: 04-1] Automtation changes to match cloudsw config after migration. [homer/public] - 10https://gerrit.wikimedia.org/r/793568 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [01:26:03] (03PS2) 10Cathal Mooney: Automtation changes to match cloudsw config after migration. [homer/public] - 10https://gerrit.wikimedia.org/r/793568 (https://phabricator.wikimedia.org/T304989) [01:27:37] 10SRE, 10ops-codfw, 10Traffic: codfw: cp2038 Correctable memory error on DIMM A3 - https://phabricator.wikimedia.org/T308459 (10ssingh) [01:28:50] (03CR) 10Cathal Mooney: [C: 03+2] Automtation changes to match cloudsw config after migration. [homer/public] - 10https://gerrit.wikimedia.org/r/793568 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [01:29:26] (03Merged) 10jenkins-bot: Automtation changes to match cloudsw config after migration. [homer/public] - 10https://gerrit.wikimedia.org/r/793568 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [01:31:48] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:36:08] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:37:57] (03CR) 10Dzahn: sre: port mx queue high page (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/792564 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [01:40:45] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:50:45] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:55:10] PROBLEM - Disk space on gitlab1003 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%): /tmp 0 MB (0% inode=96%): /var/tmp 0 MB (0% inode=96%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=gitlab1003&var-datasource=eqiad+prometheus/ops [02:05:14] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [02:07:24] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [03:01:57] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:36:43] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:46:02] (03PS2) 10KartikMistry: Enable ContentTranslation as default for cs, el, he, ko and tr WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793444 (https://phabricator.wikimedia.org/T298239) [05:04:51] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:09:30] !log dbmaint s1@eqiad T298554 [05:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:09:37] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [05:12:15] 10SRE, 10Infrastructure-Foundations, 10netops: codfw: Provision a server script can not run without a cable ID" - https://phabricator.wikimedia.org/T308768 (10Marostegui) p:05Triage→03Medium a:03Papaul [05:13:33] (03CR) 10Muehlenhoff: "Looks good, merging." [puppet] - 10https://gerrit.wikimedia.org/r/792693 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [05:13:35] (03CR) 10Muehlenhoff: [C: 03+2] vagrant: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792693 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [05:14:47] PROBLEM - BGP status on cr2-eqdfw is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE, AS6939/IPv6: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:18:26] (03CR) 10Muehlenhoff: [C: 03+2] striker: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793397 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [05:18:33] (03PS2) 10Muehlenhoff: striker: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793397 (https://phabricator.wikimedia.org/T308013) [05:22:49] (03PS2) 10Muehlenhoff: dnsdist: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793400 (https://phabricator.wikimedia.org/T308013) [05:25:13] (03CR) 10Muehlenhoff: [C: 03+2] dnsdist: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793400 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [05:26:48] (03PS2) 10Muehlenhoff: gitlab/gitlab_runner: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793399 (https://phabricator.wikimedia.org/T308013) [05:31:02] (03PS3) 10Muehlenhoff: bird/fastnetmon: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793398 (https://phabricator.wikimedia.org/T308013) [05:33:53] (03PS1) 10Marostegui: Revert "db2092: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/793587 [05:34:35] (03CR) 10Marostegui: [C: 03+2] Revert "db2092: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/793587 (owner: 10Marostegui) [05:35:34] (03CR) 10Muehlenhoff: [C: 03+2] bird/fastnetmon: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793398 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [05:39:41] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:51:00] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:03:04] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10Dzahn) >>! In T306654#7942789, @jbond wrote: > the work flow that requires puppet-merge @Jclark-ctr Correct me if I'm wrong but this is mostly abou... [06:03:19] !log racadm racreset on ganeti5003 [06:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:37] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:15:09] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03RobH @RobH I'm unable to reimage ganeti5003, the ipmitool call fails with "Error: Unable to establish IPMI v2 / RMCP+ session" I've tried a racres... [06:15:51] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (20) node(s) change every puppet run: an-tool1005, cloudservices1003, cloudservices1004, ms-be1068, ms-be1069, ms-be1070, ms-be1071, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, releases1002, releases2002, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002, thanos-fe2003 https://wikitech.wik [06:15:51] rg/wiki/Puppet%23check_puppet_run_changes [06:18:21] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic: Betacommons: 504, Connection Timed Out at 2022-05-02 13:35:16 GMT - https://phabricator.wikimedia.org/T307354 (10Marostegui) 05Open→03Resolved Closing this for now. Reopen if needed. Thanks for reporting! [06:19:21] (03PS1) 10Muehlenhoff: Switch idp-test2002 to idp_test role [puppet] - 10https://gerrit.wikimedia.org/r/793634 (https://phabricator.wikimedia.org/T308214) [06:19:37] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.12 point update - https://phabricator.wikimedia.org/T304546 (10Marostegui) @MoritzMuehlenhoff good to close? [06:19:46] (03CR) 10Muehlenhoff: [C: 03+2] Enable component/ganeti3 for the esams cluster [puppet] - 10https://gerrit.wikimedia.org/r/793491 (https://phabricator.wikimedia.org/T308238) (owner: 10Muehlenhoff) [06:20:13] (03PS2) 10Muehlenhoff: Switch idp-test2002 to idp_test role [puppet] - 10https://gerrit.wikimedia.org/r/793634 (https://phabricator.wikimedia.org/T308214) [06:34:35] (03PS1) 10Marostegui: db1118: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/793635 [06:36:38] (03CR) 10Marostegui: [C: 03+2] db1118: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/793635 (owner: 10Marostegui) [06:36:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 1%: After switchover', diff saved to https://phabricator.wikimedia.org/P28164 and previous config saved to /var/cache/conftool/dbconfig/20220520-063656-root.json [06:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:49] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:41:33] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:46:10] (03CR) 10Slyngshede: [C: 03+2] WIP: Trial implementation of a private APT repo. [puppet] - 10https://gerrit.wikimedia.org/r/793036 (https://phabricator.wikimedia.org/T308027) (owner: 10Slyngshede) [06:48:52] 10SRE, 10SRE-Access-Requests, 10Machine-Learning-Team (Active Tasks): Requesting access to the deployment POSIX group for aikochou and kevinbazira - https://phabricator.wikimedia.org/T308308 (10elukey) @thcipriani Hi! When you have a moment, could you please review this request and let me know if it is a goo... [06:52:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 5%: After switchover', diff saved to https://phabricator.wikimedia.org/P28166 and previous config saved to /var/cache/conftool/dbconfig/20220520-065200-root.json [06:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:44] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Technically LGTM, however this might have some repercussions. There is some minor puppet stuff (mostly mtail tests, so nothing breaking) t" [debs/pybal] - 10https://gerrit.wikimedia.org/r/743222 (owner: 10Ebernhardson) [06:55:55] (03CR) 10Giuseppe Lavagetto: mediawiki::system_users: add mwpresync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793418 (https://phabricator.wikimedia.org/T303857) (owner: 10Giuseppe Lavagetto) [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220520T0700) [07:01:25] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:01:57] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:03:25] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48108 bytes in 0.119 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:06:26] (03PS1) 10Slyngshede: WIP: Private APT repo [puppet] - 10https://gerrit.wikimedia.org/r/793704 [07:07:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 10%: After switchover', diff saved to https://phabricator.wikimedia.org/P28167 and previous config saved to /var/cache/conftool/dbconfig/20220520-070704-root.json [07:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:01] (03PS2) 10Slyngshede: WIP: Private APT repo [puppet] - 10https://gerrit.wikimedia.org/r/793704 [07:09:49] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:10:08] (03CR) 10Muehlenhoff: WIP: Private APT repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793704 (owner: 10Slyngshede) [07:10:53] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:11:20] (03PS3) 10Slyngshede: WIP: Private APT repo [puppet] - 10https://gerrit.wikimedia.org/r/793704 [07:11:54] (03CR) 10Slyngshede: WIP: Private APT repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793704 (owner: 10Slyngshede) [07:14:57] (03Abandoned) 10Jcrespo: Revert "dumps: Block python requests UA" [puppet] - 10https://gerrit.wikimedia.org/r/784715 (owner: 10Jcrespo) [07:15:30] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/793704 (owner: 10Slyngshede) [07:15:50] (03CR) 10Slyngshede: [C: 03+2] WIP: Private APT repo [puppet] - 10https://gerrit.wikimedia.org/r/793704 (owner: 10Slyngshede) [07:18:41] (03CR) 10Muehlenhoff: [C: 03+1] "Looks great, ship it" [debs/kubeconform] - 10https://gerrit.wikimedia.org/r/792999 (https://phabricator.wikimedia.org/T306165) (owner: 10JMeybohm) [07:22:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 25%: After switchover', diff saved to https://phabricator.wikimedia.org/P28168 and previous config saved to /var/cache/conftool/dbconfig/20220520-072208-root.json [07:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:50] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.12 point update - https://phabricator.wikimedia.org/T304546 (10MoritzMuehlenhoff) That is still work in progress [07:25:19] (03CR) 10Muehlenhoff: [C: 03+2] Switch idp-test2002 to idp_test role [puppet] - 10https://gerrit.wikimedia.org/r/793634 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [07:33:17] (03CR) 10JMeybohm: [C: 03+2] Add debian directory [debs/kubeconform] - 10https://gerrit.wikimedia.org/r/792999 (https://phabricator.wikimedia.org/T306165) (owner: 10JMeybohm) [07:35:11] (03PS1) 10Elukey: profile::cassandra::single_instance: add target_version [puppet] - 10https://gerrit.wikimedia.org/r/793707 (https://phabricator.wikimedia.org/T302232) [07:35:32] (03PS1) 10Muehlenhoff: profile::idp::build: Add missing package required for CAS build [puppet] - 10https://gerrit.wikimedia.org/r/793708 [07:35:54] (03Merged) 10jenkins-bot: Add debian directory [debs/kubeconform] - 10https://gerrit.wikimedia.org/r/792999 (https://phabricator.wikimedia.org/T306165) (owner: 10JMeybohm) [07:36:26] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35429/console" [puppet] - 10https://gerrit.wikimedia.org/r/793707 (https://phabricator.wikimedia.org/T302232) (owner: 10Elukey) [07:36:45] (03PS2) 10Muehlenhoff: profile::idp::build: Add missing package required for CAS build [puppet] - 10https://gerrit.wikimedia.org/r/793708 [07:37:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 50%: After switchover', diff saved to https://phabricator.wikimedia.org/P28169 and previous config saved to /var/cache/conftool/dbconfig/20220520-073712-root.json [07:37:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:50] (03CR) 10Elukey: [V: 03+1] "The profile::cassandra::single_instance seems used only in deployment-prep afaics, so maybe it is not really used at all. Asking around be" [puppet] - 10https://gerrit.wikimedia.org/r/793707 (https://phabricator.wikimedia.org/T302232) (owner: 10Elukey) [07:40:31] (03CR) 10Muehlenhoff: [C: 03+2] profile::idp::build: Add missing package required for CAS build [puppet] - 10https://gerrit.wikimedia.org/r/793708 (owner: 10Muehlenhoff) [07:48:02] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 (10MoritzMuehlenhoff) [07:50:31] (03PS1) 10Slyngshede: WIP: Private APT repo. [puppet] - 10https://gerrit.wikimedia.org/r/793710 [07:51:07] (03CR) 10jerkins-bot: [V: 04-1] WIP: Private APT repo. [puppet] - 10https://gerrit.wikimedia.org/r/793710 (owner: 10Slyngshede) [07:52:03] !log imported kubeconform 0.4.13-1 to buster-,bullseye-wikimedia - T306165 [07:52:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:09] T306165: Replace kubeyaml in deployment-charts CI - https://phabricator.wikimedia.org/T306165 [07:52:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 75%: After switchover', diff saved to https://phabricator.wikimedia.org/P28170 and previous config saved to /var/cache/conftool/dbconfig/20220520-075215-root.json [07:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:03] (03PS2) 10Slyngshede: WIP: Private APT repo. [puppet] - 10https://gerrit.wikimedia.org/r/793710 [07:53:11] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2062.codfw.wmnet with OS bullseye [07:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:15] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2062.codfw.wmnet with OS bullseye [07:54:04] (03PS2) 10Elukey: profile::cassandra::single_instance: add target_version and rack [puppet] - 10https://gerrit.wikimedia.org/r/793707 (https://phabricator.wikimedia.org/T302232) [07:54:06] (03PS1) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711 [07:55:03] (03CR) 10jerkins-bot: [V: 04-1] [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711 (owner: 10Jcrespo) [07:55:18] (03PS3) 10Elukey: profile::cassandra::single_instance: add target_version and rack [puppet] - 10https://gerrit.wikimedia.org/r/793707 (https://phabricator.wikimedia.org/T302232) [07:56:18] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35430/console" [puppet] - 10https://gerrit.wikimedia.org/r/793707 (https://phabricator.wikimedia.org/T302232) (owner: 10Elukey) [08:06:58] PROBLEM - Check no envoy runtime configuration is left persistent on idp-test2002 is CRITICAL: connect to address 127.0.0.1 and port 9631: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:07:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 100%: After switchover', diff saved to https://phabricator.wikimedia.org/P28171 and previous config saved to /var/cache/conftool/dbconfig/20220520-080719-root.json [08:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:53] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2062.codfw.wmnet with reason: host reimage [08:09:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:22] PROBLEM - Check that envoy is running on idp-test2002 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is inactive https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:12:49] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2062.codfw.wmnet with reason: host reimage [08:12:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:18] (03PS1) 10Elukey: Add new Cassandra cluster for ML cache/feature-store workloads in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/793714 (https://phabricator.wikimedia.org/T302232) [08:21:28] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35431/console" [puppet] - 10https://gerrit.wikimedia.org/r/793714 (https://phabricator.wikimedia.org/T302232) (owner: 10Elukey) [08:22:32] (03PS2) 10Elukey: Add new Cassandra cluster for ML cache/feature-store workloads in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/793714 (https://phabricator.wikimedia.org/T302232) [08:38:40] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/793707 (https://phabricator.wikimedia.org/T302232) (owner: 10Elukey) [08:42:28] (03CR) 10Filippo Giunchedi: "Thanks for the feedback folks! I'll merge early next week" [alerts] - 10https://gerrit.wikimedia.org/r/792564 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [08:43:08] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! (untested)" [puppet] - 10https://gerrit.wikimedia.org/r/793094 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [08:44:45] (03CR) 10JMeybohm: [C: 03+1] "recheck (for https://gerrit.wikimedia.org/r/c/integration/config/+/793506)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/791324 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [08:44:48] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2062.codfw.wmnet with OS bullseye [08:44:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:52] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2062.codfw.wmnet with OS bullseye completed: - ms-be2062 (**PASS**) - Downtim... [08:45:05] (03CR) 10Jcrespo: alerting_host: Remove references to dbbackups monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791560 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [08:45:37] (03CR) 10JMeybohm: "recheck (expected to fail, missing local schema)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/791794 (https://phabricator.wikimedia.org/T306165) (owner: 10JMeybohm) [08:46:00] PROBLEM - Memcached on idp-test2002 is CRITICAL: connect to address 208.80.153.70 and port 11000: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [08:46:44] (03CR) 10Jcrespo: [C: 03+2] alert_host: Ensure packages and files from dbbackups check are gone [puppet] - 10https://gerrit.wikimedia.org/r/793094 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [08:46:48] (03CR) 10jerkins-bot: [V: 04-1] Replace kubeyaml with kubeconform (if available) [deployment-charts] - 10https://gerrit.wikimedia.org/r/791794 (https://phabricator.wikimedia.org/T306165) (owner: 10JMeybohm) [08:48:17] (03PS1) 10Elukey: Add fake secret for the new ML Cassandra cluster [labs/private] - 10https://gerrit.wikimedia.org/r/793717 [08:48:32] (03CR) 10Btullis: "I've now generated the keys in the private repo, so this should be unblocked." [puppet] - 10https://gerrit.wikimedia.org/r/791663 (https://phabricator.wikimedia.org/T307798) (owner: 10Eevans) [08:48:40] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake secret for the new ML Cassandra cluster [labs/private] - 10https://gerrit.wikimedia.org/r/793717 (owner: 10Elukey) [08:51:45] (03PS3) 10Elukey: Add new Cassandra cluster for ML cache/feature-store workloads in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/793714 (https://phabricator.wikimedia.org/T302232) [08:51:48] (03CR) 10Btullis: [C: 03+1] "This looks good to me. Should I +2 and deploy now?" [puppet] - 10https://gerrit.wikimedia.org/r/791663 (https://phabricator.wikimedia.org/T307798) (owner: 10Eevans) [08:51:52] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/793399 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [08:52:05] (03PS2) 10Jcrespo: ddbackups: Remove old references to the check pass on the alert hosts [labs/private] - 10https://gerrit.wikimedia.org/r/793042 (https://phabricator.wikimedia.org/T283017) [08:52:16] (03PS2) 10Btullis: Enable cassandra encryption (aqs cluster) [puppet] - 10https://gerrit.wikimedia.org/r/791663 (https://phabricator.wikimedia.org/T307798) (owner: 10Eevans) [08:52:38] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35434/console" [puppet] - 10https://gerrit.wikimedia.org/r/793714 (https://phabricator.wikimedia.org/T302232) (owner: 10Elukey) [08:53:58] !log re-enabling puppet and repooling cp3060 - T308797 T243167 [08:54:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:05] T243167: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 [08:54:06] T308797: cp3060 idrac https interface failures - https://phabricator.wikimedia.org/T308797 [08:55:58] RECOVERY - purged service on cp3060 is OK: OK - purged is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:57:07] 10SRE, 10ops-esams, 10DC-Ops: cp3060 idrac https interface failures - https://phabricator.wikimedia.org/T308797 (10Vgutierrez) @RobH feel free to depool / disable-puppet again on cp3060 when you need to work on it, meanwhile I'm letting cp3060 handle some traffic in esams :) [09:00:11] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] ddbackups: Remove old references to the check pass on the alert hosts [labs/private] - 10https://gerrit.wikimedia.org/r/793042 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [09:07:20] (03PS18) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) [09:11:43] (03PS1) 10Filippo Giunchedi: sre: add fastnetmon alerting page [alerts] - 10https://gerrit.wikimedia.org/r/793723 (https://phabricator.wikimedia.org/T305847) [09:13:03] (03PS1) 10Volans: wikimedia-dns: add zone validator ignore comments [dns] - 10https://gerrit.wikimedia.org/r/793724 (https://phabricator.wikimedia.org/T155761) [09:13:05] (03PS1) 10Volans: fr-tech: fix typo in PTR record [dns] - 10https://gerrit.wikimedia.org/r/793725 (https://phabricator.wikimedia.org/T308672) [09:13:07] (03PS1) 10Volans: fr-tech: add zone validator ignore comments [dns] - 10https://gerrit.wikimedia.org/r/793726 (https://phabricator.wikimedia.org/T308672) [09:13:09] (03PS1) 10Volans: Non-WMF IPs: add zone validator ignore comments [dns] - 10https://gerrit.wikimedia.org/r/793727 (https://phabricator.wikimedia.org/T155761) [09:13:11] (03PS1) 10Volans: Duplicate IPs by design: add zone validator ignore [dns] - 10https://gerrit.wikimedia.org/r/793728 (https://phabricator.wikimedia.org/T155761) [09:13:13] (03PS1) 10Volans: wikitech-static-iad: remove obsolete records [dns] - 10https://gerrit.wikimedia.org/r/793729 (https://phabricator.wikimedia.org/T155761) [09:14:03] (03PS4) 10Filippo Giunchedi: mediawiki: remove idle php-fpm workers alert, moved to prometheus/alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/791360 (https://phabricator.wikimedia.org/T305847) [09:14:05] (03PS4) 10Filippo Giunchedi: mx: remove queue size alert, moved to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/792568 (https://phabricator.wikimedia.org/T305847) [09:14:07] (03PS1) 10Filippo Giunchedi: fastnetmon: remove alert, ported to Prometheus / Alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/793731 (https://phabricator.wikimedia.org/T305847) [09:14:09] (03CR) 10Jcrespo: "Not yet happy with the hiera parameters, so this is not deploy-ready, plus not sure if some of the change are bug fixes or intentional, bu" [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [09:15:06] 10Puppet, 10Infrastructure-Foundations: puppet admin: cxheck if additional gropus in systenmd::sysuser conflicts with admin.yaml - https://phabricator.wikimedia.org/T308826 (10jbond) p:05Triage→03Medium [09:16:20] (03PS2) 10Volans: Duplicate names by design: add zone validator ignore [dns] - 10https://gerrit.wikimedia.org/r/793728 (https://phabricator.wikimedia.org/T155761) [09:16:30] (03CR) 10Jbond: [C: 03+1] mediawiki::system_users: add mwpresync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793418 (https://phabricator.wikimedia.org/T303857) (owner: 10Giuseppe Lavagetto) [09:17:44] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2063.codfw.wmnet with OS bullseye [09:17:48] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2063.codfw.wmnet with OS bullseye [09:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:56] (03CR) 10Jbond: "CR LGTM, is there a phab task to link to?" [puppet] - 10https://gerrit.wikimedia.org/r/793550 (owner: 10JHathaway) [09:19:53] (03PS5) 10Jcrespo: alerting_host: Remove references to dbbackups monitoring [puppet] - 10https://gerrit.wikimedia.org/r/791560 (https://phabricator.wikimedia.org/T283017) [09:24:38] (03CR) 10Volans: "I've also quickly checked on the rackspace portal and didn't find any reference to wikitech-static-iad" [dns] - 10https://gerrit.wikimedia.org/r/793729 (https://phabricator.wikimedia.org/T155761) (owner: 10Volans) [09:33:45] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2063.codfw.wmnet with reason: host reimage [09:34:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:19] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ms-be2063.codfw.wmnet with reason: host reimage [09:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:19] (ProbeDown) firing: (15) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:35:19] (ProbeDown) firing: (21) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:35:34] * volans here [09:35:41] * jelto here [09:35:45] * volans acked on VO [09:35:55] here too [09:36:07] * jbond here [09:36:10] PROBLEM - Apache HTTP on mw1333 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [09:36:13] On phone [09:36:19] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page on alert1001 is CRITICAL: 0.002481 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver [09:36:22] PROBLEM - Apache HTTP on mw1365 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [09:36:23] <_joe_> it's the database [09:36:24] PROBLEM - Apache HTTP on mw1419 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [09:36:40] wow it wasn't just me ... [09:36:40] PROBLEM - Apache HTTP on mw1364 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [09:36:40] Running back [09:36:40] PROBLEM - Apache HTTP on mw1387 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [09:36:43] <_joe_> volans: take a look at the slow queries is my suggestion [09:36:46] PROBLEM - Apache HTTP on mw1319 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [09:36:48] yeah I'm looking [09:36:49] (03CR) 10Jcrespo: [C: 03+2] alerting_host: Remove references to dbbackups monitoring [puppet] - 10https://gerrit.wikimedia.org/r/791560 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [09:36:50] PROBLEM - Apache HTTP on mw1371 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [09:37:00] PROBLEM - Apache HTTP on mw1322 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [09:37:00] PROBLEM - Apache HTTP on mw1350 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [09:37:08] PROBLEM - Apache HTTP on mw1366 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [09:37:10] PROBLEM - Apache HTTP on mw1328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [09:37:12] PROBLEM - Apache HTTP on mw1368 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [09:37:18] PROBLEM - Apache HTTP on mw1369 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [09:37:20] PROBLEM - Apache HTTP on mw1372 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [09:37:21] db1172 comes quite often [09:37:24] <_joe_> marostegui: we might need help [09:37:30] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204,205} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [09:37:33] PROBLEM - LVS text-https esams port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1636 bytes in 1.474 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:37:36] PROBLEM - Apache HTTP on mw1355 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [09:37:36] PROBLEM - Apache HTTP on mw1384 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [09:37:39] PROBLEM - LVS text-https drmrs port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.drmrs.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:37:40] PROBLEM - LVS text-https ulsfo port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:37:42] PROBLEM - Apache HTTP on mw1454 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [09:37:46] PROBLEM - Apache HTTP on mw1353 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [09:37:49] PROBLEM - LVS text-https esams port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:37:53] PROBLEM - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:37:53] PROBLEM - Apache HTTP on mw1326 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [09:37:53] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1326.eqiad.wmnet, mw1433.eqiad.wmnet, mw1414.eqiad.wmnet, mw1371.eqiad.wmnet, mw1455.eqiad.wmnet, mw1453.eqiad.wmnet, mw1442.eqiad.wmnet, mw1434.eqiad.wmnet, mw1322.eqiad.wmnet, mw1432.eqiad.wmnet, mw1323.eqiad.wmnet, mw1384.eqiad.wmnet, mw1327.eqiad.wmnet, mw1413.eqiad.wmnet, mw1456.eqiad.wmnet, mw1407.eqiad.wmnet, mw [09:37:53] ad.wmnet, mw1351.eqiad.wmnet, mw1405.eqiad.wmnet, mw1329.eqiad.wmnet, mw1320.eqiad.wmnet, mw1352.eqiad.wmnet, mw1399.eqiad.wmnet, mw1368.eqiad.wmnet, mw1435.eqiad.wmnet, mw1454.eqiad.wmnet, mw1431.eqiad.wmnet, mw1333.eqiad.wmnet, mw1393.eqiad.wmnet, mw1411.eqiad.wmnet, mw1354.eqiad.wmnet, mw1366.eqiad.wmnet, mw1324.eqiad.wmnet, mw1372.eqiad.wmnet, mw1370.eqiad.wmnet, mw1397.eqiad.wmnet, mw1319.eqiad.wmnet, mw1389.eqiad.wmnet, mw1418.eqiad [09:37:53] mw1321.eqiad.wmnet, mw1395.eqiad.wmnet, mw1403.eqiad.wmnet, mw1325.eqiad.wmnet, mw1409.eqiad.wmnet, mw1385.eqiad.wmnet, mw1436.eqiad.wmnet, mw1417.eqiad.wmnet, mw1367.eqiad.wmnet, mw144 https://wikitech.wikimedia.org/wiki/PyBal [09:38:06] what's up? [09:38:07] Marostegui ^ jynus [09:38:11] is it db1172? [09:38:12] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is CRITICAL: 0.9722 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [09:38:13] db1172 could you have a loot? [09:38:14] PROBLEM - Apache HTTP on mw1349 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [09:38:14] PROBLEM - Apache HTTP on mw1354 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [09:38:18] PROBLEM - Apache HTTP on mw1351 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [09:38:18] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1433.eqiad.wmnet, mw1365.eqiad.wmnet, mw1419.eqiad.wmnet, mw1442.eqiad.wmnet, mw1434.eqiad.wmnet, mw1432.eqiad.wmnet, mw1349.eqiad.wmnet, mw1384.eqiad.wmnet, mw1387.eqiad.wmnet, mw1430.eqiad.wmnet, mw1405.eqiad.wmnet, mw1329.eqiad.wmnet, mw1320.eqiad.wmnet, mw1399.eqiad.wmnet, mw1435.eqiad.wmnet, mw1420.eqiad.wmnet, mw [09:38:18] ad.wmnet, mw1393.eqiad.wmnet, mw1454.eqiad.wmnet, mw1372.eqiad.wmnet, mw1370.eqiad.wmnet, mw1403.eqiad.wmnet, mw1389.eqiad.wmnet, mw1395.eqiad.wmnet, mw1397.eqiad.wmnet, mw1325.eqiad.wmnet, mw1385.eqiad.wmnet, mw1417.eqiad.wmnet, mw1455.eqiad.wmnet, mw1373.eqiad.wmnet, mw1326.eqiad.wmnet, mw1332.eqiad.wmnet, mw1452.eqiad.wmnet, mw1367.eqiad.wmnet, mw1414.eqiad.wmnet, mw1369.eqiad.wmnet, mw1371.eqiad.wmnet, mw1453.eqiad.wmnet, mw1322.eqiad [09:38:18] mw1319.eqiad.wmnet, mw1323.eqiad.wmnet, mw1327.eqiad.wmnet, mw1413.eqiad.wmnet, mw1456.eqiad.wmnet, mw1351.eqiad.wmnet, mw1391.eqiad.wmnet, mw1352.eqiad.wmnet, mw1441.eqiad.wmnet, mw141 https://wikitech.wikimedia.org/wiki/PyBal [09:38:18] sure [09:38:21] PROBLEM - LVS text-https drmrs port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.drmrs.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1625 bytes in 0.706 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:38:21] PROBLEM - Apache HTTP on mw1373 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [09:38:21] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [09:38:22] PROBLEM - Apache HTTP on mw1397 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [09:38:22] PROBLEM - Apache HTTP on mw1407 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [09:38:22] PROBLEM - Apache HTTP on mw1416 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [09:38:22] PROBLEM - Apache HTTP on mw1370 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [09:38:22] PROBLEM - Apache HTTP on mw1367 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [09:38:24] PROBLEM - Apache HTTP on mw1436 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [09:38:29] PROBLEM - LVS text-https codfw port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.codfw.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:38:33] PROBLEM - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:38:34] RECOVERY - Apache HTTP on mw1333 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 9.134 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:38:36] PROBLEM - LVS text-https eqsin port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.eqsin.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:38:37] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.5 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [09:38:40] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [09:38:46] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [09:38:50] RECOVERY - Apache HTTP on mw1419 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 8.743 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:39:03] PROBLEM - LVS text-https eqsin port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.eqsin.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:39:04] PROBLEM - LVS appservers-https eqiad port 443/tcp - Main MediaWiki application server cluster- appservers.svc.eqiad.wmnet -https- IPv4 #page on appservers.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:39:04] PROBLEM - Apache HTTP on mw1324 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [09:39:06] PROBLEM - LVS text-https ulsfo port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.ulsfo.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:39:07] PROBLEM - Apache HTTP on mw1352 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [09:39:14] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 167 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:39:28] !log volans@cumin1001 dbctl commit (dc=all): 'emergency depool', diff saved to https://phabricator.wikimedia.org/P28172 and previous config saved to /var/cache/conftool/dbconfig/20220520-093928-volans.json [09:39:34] I am going to depoo l it for now [09:39:38] RECOVERY - Apache HTTP on mw1328 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 8.965 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:39:40] PROBLEM - PHP7 rendering on mw1370 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:39:40] marostegui: already done [09:39:42] PROBLEM - PHP7 rendering on mw1384 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:39:59] volans@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [09:40:05] RECOVERY - LVS text-https drmrs port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.drmrs.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 18861 bytes in 7.722 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:40:06] RECOVERY - LVS text-https esams port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 18885 bytes in 7.260 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:40:08] PROBLEM - PHP7 rendering on mw1369 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:40:08] RECOVERY - Apache HTTP on mw1454 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 6.375 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:40:09] coordination is happening in the other channel [09:40:16] too much noise here [09:40:17] RECOVERY - LVS text-https esams port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 18860 bytes in 8.256 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:40:18] PROBLEM - PHP7 rendering on mw1353 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:40:18] RECOVERY - Apache HTTP on mw1326 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 6.650 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:40:20] PROBLEM - PHP7 rendering on mw1351 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:40:24] <_joe_> ok and we're magically back [09:40:30] PROBLEM - PHP7 rendering on mw1364 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:40:36] PROBLEM - PHP7 rendering on mw1368 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:40:38] PROBLEM - PHP7 rendering on mw1350 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:40:40] <_joe_> uhm no [09:40:44] PROBLEM - PHP7 rendering on mw1372 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:40:44] PROBLEM - PHP7 rendering on mw1355 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:40:46] RECOVERY - Apache HTTP on mw1397 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 5.438 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:40:46] RECOVERY - Apache HTTP on mw1416 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 5.489 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:40:46] RECOVERY - Apache HTTP on mw1407 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 6.183 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:40:48] RECOVERY - Apache HTTP on mw1436 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 6.025 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:40:50] RECOVERY - Apache HTTP on mw1367 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 9.921 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:40:56] (ProbeDown) firing: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:41:01] RECOVERY - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 18861 bytes in 9.082 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:41:01] (ProbeDown) firing: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:41:04] PROBLEM - PHP7 rendering on mw1349 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:41:04] PROBLEM - PHP7 rendering on mw1373 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:41:13] <_joe_> I'm not sure we're out of the woods [09:41:20] PROBLEM - PHP7 rendering on mw1352 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:41:29] RECOVERY - LVS appservers-https eqiad port 443/tcp - Main MediaWiki application server cluster- appservers.svc.eqiad.wmnet -https- IPv4 #page on appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 17748 bytes in 7.332 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:41:30] RECOVERY - Apache HTTP on mw1324 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 7.413 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:41:31] RECOVERY - LVS text-https ulsfo port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 18861 bytes in 5.797 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:41:31] RECOVERY - Apache HTTP on mw1387 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 4.439 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:41:40] RECOVERY - Apache HTTP on mw1319 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 6.901 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:41:46] PROBLEM - PHP7 rendering on mw1365 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:41:48] PROBLEM - PHP7 rendering on mw1354 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:41:54] RECOVERY - Apache HTTP on mw1322 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 6.669 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:42:01] looks like moved to db1131 [09:42:04] PROBLEM - PHP7 rendering on mw1371 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:42:06] PROBLEM - PHP7 rendering on mw1367 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:42:20] PROBLEM - PHP7 rendering on mw1366 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:42:26] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:42:29] RECOVERY - LVS text-https ulsfo port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 18873 bytes in 0.840 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:42:45] RECOVERY - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 18872 bytes in 4.661 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:43:16] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [09:43:21] RECOVERY - LVS text-https drmrs port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.drmrs.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 18873 bytes in 5.734 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:43:28] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [09:43:54] RECOVERY - Apache HTTP on mw1364 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 7.795 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:44:00] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 48 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:44:28] RECOVERY - PHP7 rendering on mw1384 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 9.152 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:45:14] RECOVERY - PHP7 rendering on mw1368 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 9.946 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:45:18] (ProbeDown) firing: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:45:19] (ProbeDown) firing: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:45:28] RECOVERY - LVS text-https codfw port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 18873 bytes in 2.377 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:45:45] (JobUnavailable) firing: (3) Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:45:52] RECOVERY - PHP7 rendering on mw1352 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 7.729 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:46:02] PROBLEM - PyBal backends health check on lvs5001 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5009.eqsin.wmnet, cp5016.eqsin.wmnet, cp5012.eqsin.wmnet, cp5007.eqsin.wmnet, cp5010.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5009.eqsin.wmnet, cp5016.eqsin.wmnet, cp5012.eqsin.wmnet, cp5007.eqsin.wmnet, cp5010.eqsin.wmnet are marked down but pooled: testlb6_443: Servers cp5009.eqsin.wmnet, cp5016.eqsin.wmne [09:46:02] 8.eqsin.wmnet, cp5012.eqsin.wmnet, cp5007.eqsin.wmnet, cp5010.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5009.eqsin.wmnet, cp5011.eqsin.wmnet, cp5016.eqsin.wmnet, cp5008.eqsin.wmnet, cp5012.eqsin.wmnet, cp5010.eqsin.wmnet, cp5007.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:46:12] RECOVERY - PHP7 rendering on mw1354 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 6.490 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:46:14] RECOVERY - Apache HTTP on mw1371 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 6.355 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:46:14] RECOVERY - PHP7 rendering on mw1365 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 9.391 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:46:16] RECOVERY - Apache HTTP on mw1350 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:46:22] RECOVERY - PHP7 rendering on mw1371 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 1.564 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:46:22] RECOVERY - PHP7 rendering on mw1367 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:46:24] RECOVERY - Apache HTTP on mw1366 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 3.507 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:46:26] RECOVERY - PHP7 rendering on mw1370 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.731 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:46:26] RECOVERY - Apache HTTP on mw1368 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:46:28] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:46:30] PROBLEM - PyBal backends health check on lvs5003 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5009.eqsin.wmnet, cp5007.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5009.eqsin.wmnet, cp5008.eqsin.wmnet, cp5012.eqsin.wmnet, cp5010.eqsin.wmnet, cp5007.eqsin.wmnet are marked down but pooled: testlb6_443: Servers cp5009.eqsin.wmnet, cp5012.eqsin.wmnet, cp5008.eqsin.wmnet, cp5007.eqsin.wmnet are marked down bu [09:46:30] : textlb6_443: Servers cp5009.eqsin.wmnet, cp5012.eqsin.wmnet, cp5008.eqsin.wmnet, cp5007.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:46:32] RECOVERY - Apache HTTP on mw1369 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.161 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:46:32] RECOVERY - Apache HTTP on mw1372 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.268 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:46:34] RECOVERY - PHP7 rendering on mw1366 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.059 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:46:48] RECOVERY - Apache HTTP on mw1384 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:46:48] RECOVERY - Apache HTTP on mw1355 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.060 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:46:52] RECOVERY - PHP7 rendering on mw1369 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:46:56] RECOVERY - Apache HTTP on mw1353 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:46:58] RECOVERY - PHP7 rendering on mw1353 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:46:58] RECOVERY - PHP7 rendering on mw1351 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:47:08] RECOVERY - PHP7 rendering on mw1364 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:47:08] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:47:16] RECOVERY - PHP7 rendering on mw1350 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:47:20] RECOVERY - Apache HTTP on mw1349 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:47:20] RECOVERY - Apache HTTP on mw1354 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:47:22] RECOVERY - PHP7 rendering on mw1355 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:47:22] RECOVERY - PHP7 rendering on mw1372 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:47:26] RECOVERY - Apache HTTP on mw1351 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:47:28] RECOVERY - Apache HTTP on mw1373 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:47:30] RECOVERY - Apache HTTP on mw1370 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:47:34] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:47:42] RECOVERY - PHP7 rendering on mw1349 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:47:42] RECOVERY - PHP7 rendering on mw1373 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:47:46] RECOVERY - LVS text-https eqsin port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.eqsin.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 18873 bytes in 3.661 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:47:48] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10Bawolff) [09:47:54] RECOVERY - Apache HTTP on mw1365 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:47:55] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.634 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver [09:47:58] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [09:48:09] RECOVERY - LVS text-https eqsin port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.eqsin.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 18859 bytes in 2.065 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:48:10] RECOVERY - Apache HTTP on mw1352 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:48:16] RECOVERY - PyBal backends health check on lvs5001 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:48:44] RECOVERY - PyBal backends health check on lvs5003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:48:52] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5003 is CRITICAL: 5.145e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5003 [09:49:04] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [09:49:04] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5007 is CRITICAL: 3.21e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5007 [09:49:32] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5015 is CRITICAL: 3.301e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5015 [09:49:40] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [09:50:02] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [09:50:18] (ProbeDown) resolved: (20) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:50:18] (ProbeDown) resolved: (18) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:50:45] (JobUnavailable) firing: (3) Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:51:00] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5003 is OK: (C)5000 gt (W)3000 gt 649.6 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5003 [09:51:14] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5007 is OK: (C)5000 gt (W)3000 gt 197.9 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5007 [09:51:38] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5015 is OK: (C)5000 gt (W)3000 gt 223.7 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5015 [10:01:10] PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:01:34] 10SRE, 10Wikimedia-Incident: Very long loading and crash with an error across all Wikimedia sites - https://phabricator.wikimedia.org/T308827 (10RhinosF1) p:05Unbreak!→03Lowest Dupe of mentioned task [10:02:07] 10SRE, 10Wikimedia-Incident: Very long loading and crash with an error across all Wikimedia sites - https://phabricator.wikimedia.org/T308827 (10RhinosF1) [10:04:42] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2063.codfw.wmnet with OS bullseye [10:04:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:46] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2063.codfw.wmnet with OS bullseye completed: - ms-be2063 (**WARN**) - Downtim... [10:06:18] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:08:15] (03PS1) 10Ladsgroup: Revert read new on frwiki for templatelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793737 [10:08:44] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:08:53] (03PS2) 10Ladsgroup: Revert read new on frwiki for templatelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793737 [10:14:25] (03CR) 10Ladsgroup: [C: 03+2] Revert read new on frwiki for templatelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793737 (owner: 10Ladsgroup) [10:15:12] (03Merged) 10jenkins-bot: Revert read new on frwiki for templatelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793737 (owner: 10Ladsgroup) [10:17:04] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:793737|Revert read new on frwiki for templatelinks migration]] (duration: 00m 51s) [10:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:18:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:19:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:19:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:46] (03PS1) 10Jbond: CONTRIBUTORS: add Brian Wolff [puppet] - 10https://gerrit.wikimedia.org/r/793740 (https://phabricator.wikimedia.org/T308013) [10:26:28] (03CR) 10Jbond: [C: 03+2] CONTRIBUTORS: add Brian Wolff [puppet] - 10https://gerrit.wikimedia.org/r/793740 (https://phabricator.wikimedia.org/T308013) (owner: 10Jbond) [10:34:20] (03CR) 10Hnowlan: [C: 03+2] Add helmfile configuration for image-suggestion [deployment-charts] - 10https://gerrit.wikimedia.org/r/791324 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [10:36:57] * TheresNoTime looks at scrollback [10:37:10] and a *good morning* to you too icinga-wm o.O [10:39:50] (03Merged) 10jenkins-bot: Add helmfile configuration for image-suggestion [deployment-charts] - 10https://gerrit.wikimedia.org/r/791324 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [10:42:15] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:42:39] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/image-suggestion: sync [10:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:38] (03CR) 10Muehlenhoff: [C: 03+1] "For code aimed for production using systemd::sysuser is nowadays the preferred way, but since this is for some initial limited to Cloud VP" [puppet] - 10https://gerrit.wikimedia.org/r/793710 (owner: 10Slyngshede) [10:51:27] (03CR) 10Giuseppe Lavagetto: "I would bump this up to two minutes, otherwise lgtm." [alerts] - 10https://gerrit.wikimedia.org/r/791356 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [10:52:42] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/image-suggestion: sync [10:52:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:03] !log uploaded cas 6.4.6.3-wmf11u1 to apt.wikimedia.org/bullseye [10:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:11] (03PS3) 10Filippo Giunchedi: sre: port mediawiki php-fpm saturation alert [alerts] - 10https://gerrit.wikimedia.org/r/791356 (https://phabricator.wikimedia.org/T305847) [10:56:13] (03PS2) 10Filippo Giunchedi: sre: port mx queue high page [alerts] - 10https://gerrit.wikimedia.org/r/792564 (https://phabricator.wikimedia.org/T305847) [10:56:15] (03PS2) 10Filippo Giunchedi: sre: add fastnetmon alerting page [alerts] - 10https://gerrit.wikimedia.org/r/793723 (https://phabricator.wikimedia.org/T305847) [10:56:17] (03CR) 10Filippo Giunchedi: sre: port mediawiki php-fpm saturation alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/791356 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [10:58:19] (03PS1) 10Muehlenhoff: Only add component/memcached16 on Buster [puppet] - 10https://gerrit.wikimedia.org/r/793744 (https://phabricator.wikimedia.org/T308214) [10:59:47] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/793744 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [11:01:57] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:09:39] !log drop backupcheck users from m1>dbbackups [11:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db2104.codfw.wmnet with reason: Maintenance [11:10:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2104.codfw.wmnet with reason: Maintenance [11:10:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 8:00:00 on 8 hosts with reason: Maintenance [11:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 8:00:00 on 8 hosts with reason: Maintenance [11:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1131.eqiad.wmnet with reason: Maintenance [11:12:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1131.eqiad.wmnet with reason: Maintenance [11:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T298555)', diff saved to https://phabricator.wikimedia.org/P28173 and previous config saved to /var/cache/conftool/dbconfig/20220520-111239-ladsgroup.json [11:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:47] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [11:14:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [11:15:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [11:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:05] (03PS1) 10Hnowlan: image-suggestion: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/793747 (https://phabricator.wikimedia.org/T304891) [11:18:47] (03PS1) 10Jbond: rake - spdx: update conversion job to detect contributors [puppet] - 10https://gerrit.wikimedia.org/r/793748 (https://phabricator.wikimedia.org/T308013) [11:20:03] (03CR) 10jerkins-bot: [V: 04-1] rake - spdx: update conversion job to detect contributors [puppet] - 10https://gerrit.wikimedia.org/r/793748 (https://phabricator.wikimedia.org/T308013) (owner: 10Jbond) [11:20:30] (03PS2) 10Jbond: rake - spdx: update conversion job to detect contributors [puppet] - 10https://gerrit.wikimedia.org/r/793748 (https://phabricator.wikimedia.org/T308013) [11:20:51] (03CR) 10Jbond: [C: 03+2] rake - spdx: update conversion job to detect contributors [puppet] - 10https://gerrit.wikimedia.org/r/793748 (https://phabricator.wikimedia.org/T308013) (owner: 10Jbond) [11:21:44] (03CR) 10jerkins-bot: [V: 04-1] rake - spdx: update conversion job to detect contributors [puppet] - 10https://gerrit.wikimedia.org/r/793748 (https://phabricator.wikimedia.org/T308013) (owner: 10Jbond) [11:21:52] (03CR) 10Slyngshede: [C: 03+2] WIP: Private APT repo. [puppet] - 10https://gerrit.wikimedia.org/r/793710 (owner: 10Slyngshede) [11:22:28] (03PS3) 10Jbond: rake - spdx: update conversion job to detect contributors [puppet] - 10https://gerrit.wikimedia.org/r/793748 (https://phabricator.wikimedia.org/T308013) [11:22:43] (03CR) 10Jbond: [C: 03+2] rake - spdx: update conversion job to detect contributors [puppet] - 10https://gerrit.wikimedia.org/r/793748 (https://phabricator.wikimedia.org/T308013) (owner: 10Jbond) [11:24:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [11:24:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [11:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T303603)', diff saved to https://phabricator.wikimedia.org/P28174 and previous config saved to /var/cache/conftool/dbconfig/20220520-112449-ladsgroup.json [11:24:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:56] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [11:29:16] (03PS1) 10Muehlenhoff: Also add component/idp-test for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/793751 (https://phabricator.wikimedia.org/T308214) [11:30:07] slyngs: happy for me to merge your CR [11:30:27] Yes, it's just for testing [11:31:05] But I was wondering where it went :-) [11:31:06] ack on sec im going to quickly send another one with theses [11:31:07] (03CR) 10Muehlenhoff: [C: 03+2] Also add component/idp-test for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/793751 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [11:31:59] jbond: you can also merge along my change, then [11:32:07] ack will do thanks [11:32:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T303603)', diff saved to https://phabricator.wikimedia.org/P28175 and previous config saved to /var/cache/conftool/dbconfig/20220520-113207-ladsgroup.json [11:32:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [11:32:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [11:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:13] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [11:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:33] (03PS1) 10Jbond: rake: sremove debug statments [puppet] - 10https://gerrit.wikimedia.org/r/793753 [11:32:46] (03CR) 10Jbond: [V: 03+2 C: 03+2] rake: sremove debug statments [puppet] - 10https://gerrit.wikimedia.org/r/793753 (owner: 10Jbond) [11:33:21] slyngs: moritzm: merge [11:34:08] d [11:36:21] (03PS3) 10JMeybohm: Remove null creationTimestamp from CRDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/792267 (https://phabricator.wikimedia.org/T306165) [11:36:23] (03PS2) 10JMeybohm: Add crds.yaml fixtures to charts and istio schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/793509 (https://phabricator.wikimedia.org/T306165) [11:36:25] (03PS8) 10JMeybohm: Replace kubeyaml with kubeconform (if available) [deployment-charts] - 10https://gerrit.wikimedia.org/r/791794 (https://phabricator.wikimedia.org/T306165) [11:37:09] 10Puppet, 10Infrastructure-Foundations: puppet admin: check if additional gropus in systemd::sysuser conflicts with admin.yaml - https://phabricator.wikimedia.org/T308826 (10jbond) [11:38:55] 10Puppet, 10Infrastructure-Foundations: puppet admin: check if additional groups in systemd::sysuser conflicts with admin.yaml - https://phabricator.wikimedia.org/T308826 (10jbond) [11:40:43] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/793744 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [11:41:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [11:41:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [11:41:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T303603)', diff saved to https://phabricator.wikimedia.org/P28176 and previous config saved to /var/cache/conftool/dbconfig/20220520-114202-ladsgroup.json [11:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:11] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [11:42:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T298555)', diff saved to https://phabricator.wikimedia.org/P28177 and previous config saved to /var/cache/conftool/dbconfig/20220520-114234-ladsgroup.json [11:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:39] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [11:42:43] (03PS2) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711 [11:43:28] (03CR) 10Jbond: [C: 03+1] mediawiki::system_users: add mwpresync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793418 (https://phabricator.wikimedia.org/T303857) (owner: 10Giuseppe Lavagetto) [11:43:30] (03CR) 10jerkins-bot: [V: 04-1] [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711 (owner: 10Jcrespo) [11:47:07] (03PS1) 10Jbond: P:sretest: test that sysuser can add users to groups managed by admin.yaml [puppet] - 10https://gerrit.wikimedia.org/r/793757 (https://phabricator.wikimedia.org/T308826) [11:47:42] (03CR) 10jerkins-bot: [V: 04-1] P:sretest: test that sysuser can add users to groups managed by admin.yaml [puppet] - 10https://gerrit.wikimedia.org/r/793757 (https://phabricator.wikimedia.org/T308826) (owner: 10Jbond) [11:48:32] (03PS2) 10Jbond: P:sretest: test that sysuser can add users to groups managed by admin.yaml [puppet] - 10https://gerrit.wikimedia.org/r/793757 (https://phabricator.wikimedia.org/T308826) [11:48:49] (03PS2) 10Hnowlan: image-suggestion: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/793747 (https://phabricator.wikimedia.org/T304891) [11:49:10] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35437/console" [puppet] - 10https://gerrit.wikimedia.org/r/793757 (https://phabricator.wikimedia.org/T308826) (owner: 10Jbond) [11:54:14] 10SRE, 10serviceops, 10Continuous-Integration-Config, 10Release-Engineering-Team (CI & Testing services), 10Test-Coverage: Add pcov PHP extension to wikimedia apt so it can be used in Wikimedia CI - https://phabricator.wikimedia.org/T243847 (10Daimona) >>! In T243847#7689119, @Daimona wrote: > Sorry for... [11:54:41] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2064.codfw.wmnet with OS bullseye [11:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:46] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2064.codfw.wmnet with OS bullseye [11:57:24] (03PS1) 10Ladsgroup: Turn on WRITE BOTH for templatelink migration in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793763 (https://phabricator.wikimedia.org/T299421) [12:00:30] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:sretest: test that sysuser can add users to groups managed by admin.yaml [puppet] - 10https://gerrit.wikimedia.org/r/793757 (https://phabricator.wikimedia.org/T308826) (owner: 10Jbond) [12:03:06] (03PS1) 10Roman Stolar: [Do Not Merge] Improvements back upstream to have stable version with "thumbor community core" dependency. [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/793765 (https://phabricator.wikimedia.org/T308561) [12:04:47] (03CR) 10jerkins-bot: [V: 04-1] [Do Not Merge] Improvements back upstream to have stable version with "thumbor community core" dependency. [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/793765 (https://phabricator.wikimedia.org/T308561) (owner: 10Roman Stolar) [12:05:18] (03PS1) 10Jbond: Revert "P:sretest: test that sysuser can add users to groups managed by admin.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/793602 [12:06:49] (03PS19) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) [12:06:51] (03PS3) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711 [12:07:08] (03PS2) 10Roman Stolar: [Do Not Merge] Improvements back upstream to have stable version with "thumbor community core" dependency. [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/793765 (https://phabricator.wikimedia.org/T308561) [12:07:13] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: puppet admin: check if additional groups in systemd::sysuser conflicts with admin.yaml - https://phabricator.wikimedia.org/T308826 (10jbond) confirmed that the addtional_gropups parameter is not compatible with groups managed by the admin module. t... [12:07:17] (03CR) 10Jbond: [C: 03+2] Revert "P:sretest: test that sysuser can add users to groups managed by admin.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/793602 (owner: 10Jbond) [12:07:42] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "P:sretest: test that sysuser can add users to groups managed by admin.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/793602 (owner: 10Jbond) [12:10:40] (03CR) 10jerkins-bot: [V: 04-1] [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [12:10:54] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2064.codfw.wmnet with reason: host reimage [12:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1098.eqiad.wmnet with reason: Maintenance [12:11:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1098.eqiad.wmnet with reason: Maintenance [12:11:12] (03CR) 10Jbond: [C: 03+1] mediawiki::system_users: add mwpresync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793418 (https://phabricator.wikimedia.org/T303857) (owner: 10Giuseppe Lavagetto) [12:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T298555)', diff saved to https://phabricator.wikimedia.org/P28178 and previous config saved to /var/cache/conftool/dbconfig/20220520-121116-ladsgroup.json [12:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:21] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [12:12:05] (03PS4) 10Hashar: Json schema from Gerrit Java event classes [software/gerrit/jsonschemagenerator] - 10https://gerrit.wikimedia.org/r/791642 (https://phabricator.wikimedia.org/T304947) [12:13:10] (03CR) 10jerkins-bot: [V: 04-1] Json schema from Gerrit Java event classes [software/gerrit/jsonschemagenerator] - 10https://gerrit.wikimedia.org/r/791642 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar) [12:13:49] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2064.codfw.wmnet with reason: host reimage [12:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:17] (03PS1) 10Stang: commonswiki: Enable wgCopyUploadAllowOnWikiDomainConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793766 (https://phabricator.wikimedia.org/T300407) [12:21:45] (03PS1) 10Jbond: CONTRIBUTORS: Add my own personal email address [puppet] - 10https://gerrit.wikimedia.org/r/793767 [12:23:27] !log killed refreshlinks suggestion in 10160 [12:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:31] in hiwiki [12:26:40] (03PS4) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711 [12:26:44] (03PS1) 10Slyngshede: WIP: Private APT repo. [puppet] - 10https://gerrit.wikimedia.org/r/793769 [12:29:14] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10Amire80) [12:30:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T303603)', diff saved to https://phabricator.wikimedia.org/P28179 and previous config saved to /var/cache/conftool/dbconfig/20220520-123037-ladsgroup.json [12:30:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [12:30:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [12:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:43] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [12:30:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T303603)', diff saved to https://phabricator.wikimedia.org/P28180 and previous config saved to /var/cache/conftool/dbconfig/20220520-123045-ladsgroup.json [12:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:22] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/793769 (owner: 10Slyngshede) [12:31:46] (03CR) 10Slyngshede: [C: 03+2] WIP: Private APT repo. [puppet] - 10https://gerrit.wikimedia.org/r/793769 (owner: 10Slyngshede) [12:37:32] !log copy prometheus-mcrouter-exporter from buster-wikimedia to bullseye-wikimedia (needed for T308214) [12:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:37] T308214: Migrate the IDPs to Bullseye - https://phabricator.wikimedia.org/T308214 [12:42:45] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@51a203f]: (no justification provided) [12:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:53] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@51a203f]: (no justification provided) (duration: 00m 07s) [12:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:26] (03PS2) 10TheDJ: Remove unused OggThumbLocation config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791605 (https://phabricator.wikimedia.org/T308191) [12:44:34] (03PS5) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711 [12:45:00] (03PS1) 10Muehlenhoff: Enable new Bullseye test IDPs in acmechief config [puppet] - 10https://gerrit.wikimedia.org/r/793770 (https://phabricator.wikimedia.org/T308214) [12:45:08] (03CR) 10ArielGlenn: "Please excuse my drive by comment. But... labstore1006 isn't always the box handling web service. Maybe it makes sense to use a service na" [puppet] - 10https://gerrit.wikimedia.org/r/793525 (https://phabricator.wikimedia.org/T306550) (owner: 10BBlack) [12:47:09] (03PS2) 10Jbond: CONTRIBUTORS: Add myself and Amir E. Aharoni to CONTRIBUTORS [puppet] - 10https://gerrit.wikimedia.org/r/793767 (https://phabricator.wikimedia.org/T308013) [12:49:15] (03PS1) 10Jcrespo: django: Add dummy django secret key and mysql pass to test compilation [labs/private] - 10https://gerrit.wikimedia.org/r/793771 (https://phabricator.wikimedia.org/T283017) [12:50:03] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] django: Add dummy django secret key and mysql pass to test compilation [labs/private] - 10https://gerrit.wikimedia.org/r/793771 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [12:51:25] (03CR) 10Ladsgroup: [C: 03+1] "I'll deploy it on Monday" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791605 (https://phabricator.wikimedia.org/T308191) (owner: 10TheDJ) [12:52:56] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10zeljkofilipin) [12:54:36] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2064.codfw.wmnet with OS bullseye [12:54:39] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2064.codfw.wmnet with OS bullseye completed: - ms-be2064 (**PASS**) - Downtim... [12:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:54] (03PS3) 10Jbond: CONTRIBUTORS: Add myself,Željko Filipin and Amir E. Aharoni to CONTRIBUTORS [puppet] - 10https://gerrit.wikimedia.org/r/793767 (https://phabricator.wikimedia.org/T308013) [12:55:32] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10santhosh) [12:55:51] (03CR) 10Jbond: [C: 03+2] CONTRIBUTORS: Add myself,Željko Filipin and Amir E. Aharoni to CONTRIBUTORS [puppet] - 10https://gerrit.wikimedia.org/r/793767 (https://phabricator.wikimedia.org/T308013) (owner: 10Jbond) [12:56:33] (03PS20) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) [12:57:11] (03PS6) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711 [12:59:21] (03CR) 10jerkins-bot: [V: 04-1] [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [12:59:23] (03PS1) 10Jbond: CONTRIBUTORS: Add Santhosh Thottingal [puppet] - 10https://gerrit.wikimedia.org/r/793772 (https://phabricator.wikimedia.org/T308013) [13:00:37] (03CR) 10Jbond: [C: 03+2] CONTRIBUTORS: Add Santhosh Thottingal [puppet] - 10https://gerrit.wikimedia.org/r/793772 (https://phabricator.wikimedia.org/T308013) (owner: 10Jbond) [13:01:38] (03PS1) 10Muehlenhoff: klaxon: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793773 (https://phabricator.wikimedia.org/T308013) [13:01:40] (03PS1) 10Muehlenhoff: helm/helmfile: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793774 (https://phabricator.wikimedia.org/T308013) [13:01:42] (03PS1) 10Muehlenhoff: thanos: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793775 (https://phabricator.wikimedia.org/T308013) [13:01:46] (03PS1) 10Muehlenhoff: thumbor: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793776 (https://phabricator.wikimedia.org/T308013) [13:01:48] (03PS1) 10Muehlenhoff: amd_rocm: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793777 [13:01:50] (03PS1) 10Muehlenhoff: debian: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793778 [13:04:02] RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:05:10] (03CR) 10jerkins-bot: [V: 04-1] debian: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793778 (owner: 10Muehlenhoff) [13:08:14] (03PS21) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) [13:08:54] (03PS1) 10Jbond: rake spdx: add task to list of missing permission contributors [puppet] - 10https://gerrit.wikimedia.org/r/793780 (https://phabricator.wikimedia.org/T308013) [13:09:32] (03CR) 10Jbond: [V: 03+2 C: 03+2] rake spdx: add task to list of missing permission contributors [puppet] - 10https://gerrit.wikimedia.org/r/793780 (https://phabricator.wikimedia.org/T308013) (owner: 10Jbond) [13:10:02] (03PS7) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711 [13:10:49] (03CR) 10jerkins-bot: [V: 04-1] [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [13:12:00] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:12:25] (03PS1) 10Jbond: rake spdx: make list unique [puppet] - 10https://gerrit.wikimedia.org/r/793781 [13:12:41] (03CR) 10Jbond: [V: 03+2 C: 03+2] rake spdx: make list unique [puppet] - 10https://gerrit.wikimedia.org/r/793781 (owner: 10Jbond) [13:15:34] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2065.codfw.wmnet with OS bullseye [13:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:39] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2065.codfw.wmnet with OS bullseye [13:17:43] (03PS22) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) [13:17:53] (03CR) 10Jcrespo: [C: 03+2] [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711 (owner: 10Jcrespo) [13:18:15] (03PS8) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711 [13:18:42] (03CR) 10Jcrespo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/793711 (owner: 10Jcrespo) [13:20:16] (03CR) 10jerkins-bot: [V: 04-1] [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [13:23:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T303603)', diff saved to https://phabricator.wikimedia.org/P28181 and previous config saved to /var/cache/conftool/dbconfig/20220520-132307-ladsgroup.json [13:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [13:23:12] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [13:23:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [13:23:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 6 hosts with reason: Maintenance [13:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 6 hosts with reason: Maintenance [13:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:56] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2038.codfw.wmnet,service=ats-be [13:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:01] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2038.codfw.wmnet,service=varnish-fe [13:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:06] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2038.codfw.wmnet,service=ats-tls [13:24:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:53] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on cp2038.codfw.wmnet with reason: downtimed because of DIMM replacement: T308459 [13:24:56] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on cp2038.codfw.wmnet with reason: downtimed because of DIMM replacement: T308459 [13:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:58] T308459: codfw: cp2038 Correctable memory error on DIMM A3 - https://phabricator.wikimedia.org/T308459 [13:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:08] (03CR) 10Ssingh: [C: 03+1] "Thanks very much for the patch, LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/793724 (https://phabricator.wikimedia.org/T155761) (owner: 10Volans) [13:31:43] (03PS9) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711 [13:33:48] (03PS10) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711 [13:35:51] 10SRE, 10ops-eqiad, 10DBA: db1164 power supply isn't redundant - https://phabricator.wikimedia.org/T308246 (10Jclark-ctr) @Marostegui sorry i thought Chris had worked on it last week it was physically unplugged and crash cart was in rack. i have plugged power back into it and it is up [13:36:02] 10SRE, 10ops-eqiad, 10DBA: db1164 power supply isn't redundant - https://phabricator.wikimedia.org/T308246 (10Jclark-ctr) 05Open→03Resolved [13:36:42] 10SRE, 10ops-eqiad, 10DBA: db1164 power supply isn't redundant - https://phabricator.wikimedia.org/T308246 (10Marostegui) Thanks John: ` ------------------------------------------------------------------------------- Record: 26 Date/Time: 05/20/2022 13:33:30 Source: system Severity: Ok Descrip... [13:36:54] (03PS23) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) [13:36:57] (03CR) 10Jbond: [C: 03+1] klaxon: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793773 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:37:04] (03CR) 10Jbond: [C: 03+1] helm/helmfile: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793774 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:37:07] (03PS11) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711 [13:37:09] (03CR) 10Jbond: [C: 03+1] thanos: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793775 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:37:12] (03PS1) 10Marostegui: Revert "db1164: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/793787 [13:37:14] (03CR) 10Jbond: [C: 03+1] thumbor: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793776 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:37:19] (03CR) 10Jbond: [C: 03+1] amd_rocm: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793777 (owner: 10Muehlenhoff) [13:37:31] (03CR) 10jerkins-bot: [V: 04-1] [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [13:37:47] (03CR) 10Jforrester: [C: 03+1] "Lovely to see this happening!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793766 (https://phabricator.wikimedia.org/T300407) (owner: 10Stang) [13:38:14] (03PS2) 10Jbond: debian: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793778 (owner: 10Muehlenhoff) [13:38:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1164 (re)pooling @ 1%: After onsite maintenance', diff saved to https://phabricator.wikimedia.org/P28182 and previous config saved to /var/cache/conftool/dbconfig/20220520-133815-root.json [13:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:18] (03CR) 10Jbond: [C: 03+1] debian: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793778 (owner: 10Muehlenhoff) [13:38:22] (03CR) 10Marostegui: [C: 03+2] Revert "db1164: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/793787 (owner: 10Marostegui) [13:39:21] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/793770 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [13:39:58] (03CR) 10jerkins-bot: [V: 04-1] debian: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793778 (owner: 10Muehlenhoff) [13:41:12] 10SRE, 10Infrastructure-Foundations, 10netops: codfw: Provision a server script can not run without a cable ID" - https://phabricator.wikimedia.org/T308768 (10Papaul) @Volans thanks I have another server to install next week i will try and let you know. [13:41:53] (03PS24) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) [13:42:06] (03PS12) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711 [13:42:31] (03CR) 10jerkins-bot: [V: 04-1] [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [13:43:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [13:43:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [13:43:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:24] (03PS1) 10Muehlenhoff: idp: Remove component/idp-test [puppet] - 10https://gerrit.wikimedia.org/r/793783 (https://phabricator.wikimedia.org/T308214) [13:44:27] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2065.codfw.wmnet with reason: host reimage [13:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance [13:45:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance [13:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1118 (T298565)', diff saved to https://phabricator.wikimedia.org/P28183 and previous config saved to /var/cache/conftool/dbconfig/20220520-134515-ladsgroup.json [13:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:23] (03PS25) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) [13:45:23] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched on wmf wikis - https://phabricator.wikimedia.org/T298565 [13:45:38] (03PS13) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711 [13:46:03] (03CR) 10jerkins-bot: [V: 04-1] [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [13:47:56] (03CR) 10Vgutierrez: [C: 03+1] Enable new Bullseye test IDPs in acmechief config [puppet] - 10https://gerrit.wikimedia.org/r/793770 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [13:48:08] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2065.codfw.wmnet with reason: host reimage [13:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:16] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/793775 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:49:02] (03PS26) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) [13:49:39] (03CR) 10jerkins-bot: [V: 04-1] [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [13:50:16] (03PS27) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) [13:51:00] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:51:58] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:52:45] (03CR) 10jerkins-bot: [V: 04-1] [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [13:53:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1164 (re)pooling @ 5%: After onsite maintenance', diff saved to https://phabricator.wikimedia.org/P28184 and previous config saved to /var/cache/conftool/dbconfig/20220520-135319-root.json [13:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [13:53:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [13:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T303603)', diff saved to https://phabricator.wikimedia.org/P28185 and previous config saved to /var/cache/conftool/dbconfig/20220520-135350-ladsgroup.json [13:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:55] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [13:54:34] 10SRE, 10ops-codfw, 10Traffic: codfw: cp2038 Correctable memory error on DIMM A3 - https://phabricator.wikimedia.org/T308459 (10ssingh) Hi @Papaul: Thanks for letting us know! The host is depooled and downtimed and so please proceed whenever you want. Thanks! [13:56:24] (03PS28) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) [13:56:54] (03PS14) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711 [13:58:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db1139.eqiad.wmnet with reason: Maintenance [13:58:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db1139.eqiad.wmnet with reason: Maintenance [13:58:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:06] (03PS1) 10Jbond: rake: spdx fix rubocop issues [puppet] - 10https://gerrit.wikimedia.org/r/793807 [14:02:18] (03CR) 10Jbond: [C: 03+2] rake: spdx fix rubocop issues [puppet] - 10https://gerrit.wikimedia.org/r/793807 (owner: 10Jbond) [14:03:00] (03PS3) 10Jbond: debian: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793778 (owner: 10Muehlenhoff) [14:03:27] (03CR) 10Jbond: "not sure why this was failing CI and others where not, however i have sent a fix and rebased this so it should be green now" [puppet] - 10https://gerrit.wikimedia.org/r/793778 (owner: 10Muehlenhoff) [14:03:35] (03PS29) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) [14:04:36] (03PS30) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) [14:05:12] (03PS15) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711 [14:05:20] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:06:00] (03PS16) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711 [14:06:39] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/793783 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [14:08:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1164 (re)pooling @ 10%: After onsite maintenance', diff saved to https://phabricator.wikimedia.org/P28186 and previous config saved to /var/cache/conftool/dbconfig/20220520-140823-root.json [14:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:10] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2065.codfw.wmnet with OS bullseye [14:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:14] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2065.codfw.wmnet with OS bullseye completed: - ms-be2065 (**PASS**) - Downtim... [14:09:16] (03PS31) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) [14:09:33] (03PS17) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711 [14:11:32] (03PS32) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) [14:11:46] (03PS18) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711 [14:12:20] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2066.codfw.wmnet with OS bullseye [14:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:24] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2066.codfw.wmnet with OS bullseye [14:13:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T303603)', diff saved to https://phabricator.wikimedia.org/P28187 and previous config saved to /var/cache/conftool/dbconfig/20220520-141308-ladsgroup.json [14:13:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [14:13:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [14:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:14] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [14:13:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T303603)', diff saved to https://phabricator.wikimedia.org/P28188 and previous config saved to /var/cache/conftool/dbconfig/20220520-141316-ladsgroup.json [14:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:34] (03PS33) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) [14:13:51] (03PS19) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711 [14:15:03] (03PS1) 10Hnowlan: CONTRIBUTORS: add hnowlan entry [puppet] - 10https://gerrit.wikimedia.org/r/793811 (https://phabricator.wikimedia.org/T308013) [14:15:31] (03PS34) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) [14:15:40] (03PS20) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711 [14:18:42] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:20:17] (03CR) 10Elukey: [C: 03+1] "super ignorant about SPDX - are we talking about the license for the Puppet module, or the ROCm suite?" [puppet] - 10https://gerrit.wikimedia.org/r/793777 (owner: 10Muehlenhoff) [14:20:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T303603)', diff saved to https://phabricator.wikimedia.org/P28189 and previous config saved to /var/cache/conftool/dbconfig/20220520-142032-ladsgroup.json [14:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:38] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [14:20:58] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/793811 (https://phabricator.wikimedia.org/T308013) (owner: 10Hnowlan) [14:22:33] (03PS9) 10Jbond: C:redis::multidc::ipsec: migrate legacy redis_shard_hosts to puppet code [puppet] - 10https://gerrit.wikimedia.org/r/793113 (https://phabricator.wikimedia.org/T308639) [14:23:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1164 (re)pooling @ 25%: After onsite maintenance', diff saved to https://phabricator.wikimedia.org/P28190 and previous config saved to /var/cache/conftool/dbconfig/20220520-142327-root.json [14:23:30] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35457/console" [puppet] - 10https://gerrit.wikimedia.org/r/793113 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [14:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:34] (03PS1) 10Filippo Giunchedi: lvs: stop double-checking catalog services from Icinga [puppet] - 10https://gerrit.wikimedia.org/r/793815 (https://phabricator.wikimedia.org/T291946) [14:24:36] (03PS1) 10Filippo Giunchedi: icinga: deprecate service::monitor class [puppet] - 10https://gerrit.wikimedia.org/r/793816 (https://phabricator.wikimedia.org/T291946) [14:24:39] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Pretty great! See my small comments but LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/791794 (https://phabricator.wikimedia.org/T306165) (owner: 10JMeybohm) [14:24:40] (03PS1) 10Filippo Giunchedi: icinga: remove 'monitoring' from service::catalog [puppet] - 10https://gerrit.wikimedia.org/r/793817 (https://phabricator.wikimedia.org/T291946) [14:25:14] (03CR) 10Jcrespo: "Hello, @moritz" [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [14:26:43] (03PS1) 10Volans: sre.hosts.reimage: DHCP workaround for row E/F [cookbooks] - 10https://gerrit.wikimedia.org/r/793818 (https://phabricator.wikimedia.org/T306421) [14:28:36] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2066.codfw.wmnet with reason: host reimage [14:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:36] (03CR) 10Jbond: [C: 03+1] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/793811 (https://phabricator.wikimedia.org/T308013) (owner: 10Hnowlan) [14:29:38] (03PS35) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) [14:30:09] (03CR) 10Giuseppe Lavagetto: [C: 03+1] C:redis::multidc::ipsec: migrate legacy redis_shard_hosts to puppet code [puppet] - 10https://gerrit.wikimedia.org/r/793113 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [14:30:19] (03PS21) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711 [14:31:19] (03CR) 10Giuseppe Lavagetto: [C: 03+1] P:profile::redis::multidc: drop legacy function redis_get_instances [puppet] - 10https://gerrit.wikimedia.org/r/793111 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [14:31:59] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2066.codfw.wmnet with reason: host reimage [14:32:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:35] (03PS36) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) [14:32:57] (03PS22) 10Jcrespo: [WIP]Move debmonitor to the new django profile format [puppet] - 10https://gerrit.wikimedia.org/r/793711 [14:38:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1164 (re)pooling @ 50%: After onsite maintenance', diff saved to https://phabricator.wikimedia.org/P28191 and previous config saved to /var/cache/conftool/dbconfig/20220520-143830-root.json [14:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T298565)', diff saved to https://phabricator.wikimedia.org/P28192 and previous config saved to /var/cache/conftool/dbconfig/20220520-144111-ladsgroup.json [14:41:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance [14:41:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance [14:41:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:18] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched on wmf wikis - https://phabricator.wikimedia.org/T298565 [14:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298555)', diff saved to https://phabricator.wikimedia.org/P28193 and previous config saved to /var/cache/conftool/dbconfig/20220520-144212-ladsgroup.json [14:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2121.codfw.wmnet with reason: Maintenance [14:42:18] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [14:42:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2121.codfw.wmnet with reason: Maintenance [14:42:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on 10 hosts with reason: Maintenance [14:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on 10 hosts with reason: Maintenance [14:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:35] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/793818 (https://phabricator.wikimedia.org/T306421) (owner: 10Volans) [14:45:41] (03CR) 10Giuseppe Lavagetto: [C: 03+1] sre: port mediawiki php-fpm saturation alert [alerts] - 10https://gerrit.wikimedia.org/r/791356 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [14:46:03] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2066.codfw.wmnet with OS bullseye [14:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:07] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2066.codfw.wmnet with OS bullseye completed: - ms-be2066 (**PASS**) - Downtim... [14:49:37] (03CR) 10Cathal Mooney: [C: 03+1] "Looks great! Thanks for taking the time volans :). This may occur on cloudsw1-e4/f4 given the criteria, but I don't believe there is any" [cookbooks] - 10https://gerrit.wikimedia.org/r/793818 (https://phabricator.wikimedia.org/T306421) (owner: 10Volans) [14:50:56] (03PS2) 10BBlack: [WIP] esitest service [puppet] - 10https://gerrit.wikimedia.org/r/793561 (https://phabricator.wikimedia.org/T308799) [14:52:25] (03CR) 10jerkins-bot: [V: 04-1] [WIP] esitest service [puppet] - 10https://gerrit.wikimedia.org/r/793561 (https://phabricator.wikimedia.org/T308799) (owner: 10BBlack) [14:53:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1164 (re)pooling @ 75%: After onsite maintenance', diff saved to https://phabricator.wikimedia.org/P28194 and previous config saved to /var/cache/conftool/dbconfig/20220520-145334-root.json [14:53:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:49] (03CR) 10JMeybohm: Replace kubeyaml with kubeconform (if available) (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/791794 (https://phabricator.wikimedia.org/T306165) (owner: 10JMeybohm) [14:54:46] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2067.codfw.wmnet with OS bullseye [14:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:50] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2067.codfw.wmnet with OS bullseye [14:57:47] (03PS9) 10JMeybohm: Replace kubeyaml with kubeconform (if available) [deployment-charts] - 10https://gerrit.wikimedia.org/r/791794 (https://phabricator.wikimedia.org/T306165) [14:57:59] (03CR) 10JMeybohm: Replace kubeyaml with kubeconform (if available) (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/791794 (https://phabricator.wikimedia.org/T306165) (owner: 10JMeybohm) [14:59:48] 10SRE, 10ops-codfw, 10Traffic: codfw: cp2038 Correctable memory error on DIMM A3 - https://phabricator.wikimedia.org/T308459 (10Papaul) @ssingh thanks will work on it when back on site next week [15:02:17] (03PS1) 10Jbond: WIP: port kafka_config to puppet function [puppet] - 10https://gerrit.wikimedia.org/r/793821 [15:02:58] (03CR) 10jerkins-bot: [V: 04-1] WIP: port kafka_config to puppet function [puppet] - 10https://gerrit.wikimedia.org/r/793821 (owner: 10Jbond) [15:04:23] (03PS1) 10Jbond: CONTRIBUTORS: add Ahmon Dancy [puppet] - 10https://gerrit.wikimedia.org/r/793822 (https://phabricator.wikimedia.org/T308013) [15:04:43] (03PS1) 10Ryan Kemper: Contributors: Add ryan kemper [puppet] - 10https://gerrit.wikimedia.org/r/793823 [15:05:41] (03CR) 10Jbond: [C: 03+2] CONTRIBUTORS: add hnowlan entry [puppet] - 10https://gerrit.wikimedia.org/r/793811 (https://phabricator.wikimedia.org/T308013) (owner: 10Hnowlan) [15:06:02] (03PS2) 10Jbond: CONTRIBUTORS: add Ahmon Dancy [puppet] - 10https://gerrit.wikimedia.org/r/793822 (https://phabricator.wikimedia.org/T308013) [15:06:26] (03PS2) 10Jbond: Contributors: Add ryan kemper [puppet] - 10https://gerrit.wikimedia.org/r/793823 (owner: 10Ryan Kemper) [15:06:30] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:06:33] (03CR) 10Jbond: [V: 03+2 C: 03+2] Contributors: Add ryan kemper [puppet] - 10https://gerrit.wikimedia.org/r/793823 (owner: 10Ryan Kemper) [15:06:56] (03PS3) 10Jbond: CONTRIBUTORS: add Ahmon Dancy [puppet] - 10https://gerrit.wikimedia.org/r/793822 (https://phabricator.wikimedia.org/T308013) [15:08:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1164 (re)pooling @ 100%: After onsite maintenance', diff saved to https://phabricator.wikimedia.org/P28195 and previous config saved to /var/cache/conftool/dbconfig/20220520-150838-root.json [15:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:54] (03CR) 10Ahmon Dancy: [C: 03+1] CONTRIBUTORS: add Ahmon Dancy [puppet] - 10https://gerrit.wikimedia.org/r/793822 (https://phabricator.wikimedia.org/T308013) (owner: 10Jbond) [15:10:57] (03CR) 10Jbond: [C: 03+2] CONTRIBUTORS: add Ahmon Dancy [puppet] - 10https://gerrit.wikimedia.org/r/793822 (https://phabricator.wikimedia.org/T308013) (owner: 10Jbond) [15:11:26] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2067.codfw.wmnet with reason: host reimage [15:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:48] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:14:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1118 T', diff saved to https://phabricator.wikimedia.org/P28196 and previous config saved to /var/cache/conftool/dbconfig/20220520-151407-ladsgroup.json [15:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:23] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2067.codfw.wmnet with reason: host reimage [15:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:00] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2068.codfw.wmnet with OS bullseye [15:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:04] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2068.codfw.wmnet with OS bullseye [15:19:17] (03PS5) 10Hashar: Json schema from Gerrit Java event classes [software/gerrit/jsonschemagenerator] - 10https://gerrit.wikimedia.org/r/791642 (https://phabricator.wikimedia.org/T304947) [15:19:45] (03CR) 10jerkins-bot: [V: 04-1] Json schema from Gerrit Java event classes [software/gerrit/jsonschemagenerator] - 10https://gerrit.wikimedia.org/r/791642 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar) [15:20:44] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:20:46] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:27:07] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/793520 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [15:28:22] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2067.codfw.wmnet with OS bullseye [15:28:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance [15:28:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance [15:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:26] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2067.codfw.wmnet with OS bullseye completed: - ms-be2067 (**PASS**) - Downtim... [15:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:48] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2069.codfw.wmnet with OS bullseye [15:29:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:53] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2069.codfw.wmnet with OS bullseye [15:33:22] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2068.codfw.wmnet with reason: host reimage [15:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:46] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2068.codfw.wmnet with reason: host reimage [15:36:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:51] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:43:25] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:46:20] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2069.codfw.wmnet with reason: host reimage [15:46:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:59] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2069.codfw.wmnet with reason: host reimage [15:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:44] (03CR) 10Hnowlan: [C: 03+2] image-suggestion: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/793747 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [15:54:27] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2068.codfw.wmnet with OS bullseye [15:54:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:31] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2068.codfw.wmnet with OS bullseye completed: - ms-be2068 (**PASS**) - Downtim... [15:58:33] (03Merged) 10jenkins-bot: image-suggestion: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/793747 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [15:58:42] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/image-suggestion: sync [15:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:43] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:00:21] (03CR) 10Tchanders: [C: 03+1] Add SimilarEditors extension – II: Add to InitialiseSettings, default off [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793500 (https://phabricator.wikimedia.org/T306909) (owner: 10Jforrester) [16:00:49] (03CR) 10Tchanders: [C: 03+1] Add SimilarEditors extension – III: Add to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793501 (https://phabricator.wikimedia.org/T306909) (owner: 10Jforrester) [16:02:18] (03PS2) 10Jbond: P:kafka: drop legacy kefka_config and kafka_config_name functions [puppet] - 10https://gerrit.wikimedia.org/r/793821 (https://phabricator.wikimedia.org/T308639) [16:03:51] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2069.codfw.wmnet with OS bullseye [16:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:56] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2069.codfw.wmnet with OS bullseye completed: - ms-be2069 (**PASS**) - Downtim... [16:04:36] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon) [16:05:10] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon) All production and pre-production codfw backends done. [16:05:53] PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:08:45] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/image-suggestion: sync [16:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:25] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/image-suggestion: apply [16:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance [16:17:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance [16:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:14] (03PS1) 10Hnowlan: aqs: allow Kubernetes nodes access to cassandra [puppet] - 10https://gerrit.wikimedia.org/r/793839 (https://phabricator.wikimedia.org/T304891) [16:19:29] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/image-suggestion: apply [16:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:38] (03PS1) 10Tchanders: Deploy IPInfo to all wikis by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793841 (https://phabricator.wikimedia.org/T260597) [16:26:44] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [16:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:03] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35463/console" [puppet] - 10https://gerrit.wikimedia.org/r/793839 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [16:31:14] (03PS1) 10Tchanders: Remove outdated comment about IPInfo from CommonSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793848 (https://phabricator.wikimedia.org/T308876) [16:31:16] (03PS1) 10Tchanders: Add comment to consult Legal before updating IPInfo access [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793849 (https://phabricator.wikimedia.org/T308876) [16:32:43] (03PS2) 10Hnowlan: aqs: allow Kubernetes nodes access to cassandra [puppet] - 10https://gerrit.wikimedia.org/r/793839 (https://phabricator.wikimedia.org/T304891) [16:33:55] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35464/console" [puppet] - 10https://gerrit.wikimedia.org/r/793839 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [16:35:00] PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:37:14] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti5003.eqsin.wmnet with OS bullseye [16:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:19] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host ganeti5003.eqsin.wmnet with OS bullseye [16:38:07] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 (10RobH) I just overwrote the password with the exact same password, as described in the comment you linked. It didn't fix it, so I checked the settings, and it seems perhaps the firmware load dis... [16:41:16] (03CR) 10JHathaway: sre: port mx queue high page (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/792564 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [16:55:50] (03CR) 10JHathaway: dumps: remove generic python 2.25.1 user agent block (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793550 (owner: 10JHathaway) [16:57:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1171.eqiad.wmnet with reason: Maintenance [16:57:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1171.eqiad.wmnet with reason: Maintenance [16:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:58] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:58:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:17] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [16:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:22] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti5003.eqsin.wmnet with reason: host reimage [17:04:25] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance [17:05:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance [17:05:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:21] RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:07:53] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti5003.eqsin.wmnet with reason: host reimage [17:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:29] PROBLEM - BGP status on cloudsw1-c8-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Connect - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:17:13] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10MarkTraceur) [17:27:55] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 (10MoritzMuehlenhoff) Thanks! [17:28:07] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti5003.eqsin.wmnet with OS bullseye [17:28:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:11] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host ganeti5003.eqsin.wmnet with OS bullseye completed: - ganeti5003 (**WARN**) - Downtimed on Icinga/Ale... [17:32:10] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 (10RobH) a:05RobH→03MoritzMuehlenhoff @MoritzMuehlenhoff, The warn is due to the host being in bios when i fired off the script, so it couldn't disable puppet on the old OS. This host is now... [17:32:26] (03CR) 10Dzahn: sre: port mx queue high page (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/792564 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [17:35:24] RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:46:40] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [17:46:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:00] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:51:55] (LogstashIngestSpike) firing: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [17:53:16] RECOVERY - BGP status on cloudsw1-c8-eqiad.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:53:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance [17:53:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance [17:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:55] !log [mwmaint1002:~] $ sudo mwscript initSiteStats.php --wiki=kcgwiki --update (to update statistics for latest wikipedia kcg) T305281 [17:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:59] T305281: Post-creation work for kcgwiki - https://phabricator.wikimedia.org/T305281 [17:56:55] (LogstashIngestSpike) resolved: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [17:58:07] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Finalise design extension of WMCS networks to new cloudsw in Eqiad rows E/F - https://phabricator.wikimedia.org/T304989 (10cmooney) Just a brief update here. I've completed the migration of the existing cloud realm networks configured on c... [18:00:57] (03PS1) 10Cathal Mooney: Change cloudsw loopback filter to common one [homer/public] - 10https://gerrit.wikimedia.org/r/793855 (https://phabricator.wikimedia.org/T304989) [18:01:14] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:01:55] (03CR) 10Cathal Mooney: [C: 03+2] Change cloudsw loopback filter to common one [homer/public] - 10https://gerrit.wikimedia.org/r/793855 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [18:02:30] (03Merged) 10jenkins-bot: Change cloudsw loopback filter to common one [homer/public] - 10https://gerrit.wikimedia.org/r/793855 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [18:04:29] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:50] PROBLEM - SSH on ms-be1066 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:09:24] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:09:58] RECOVERY - SSH on ms-be1066 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:11:40] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:13:00] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:14:36] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:15:10] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.337 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:16:20] (03PS1) 10Volans: ganeti: add set_boot_media() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/793856 (https://phabricator.wikimedia.org/T306661) [18:16:46] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48108 bytes in 0.326 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:17:12] (03CR) 10Volans: "Adding support for https://wikitech.wikimedia.org/wiki/Ganeti#Set_boot_order_to_disk" [software/spicerack] - 10https://gerrit.wikimedia.org/r/793856 (https://phabricator.wikimedia.org/T306661) (owner: 10Volans) [18:19:45] (03PS3) 10BBlack: [WIP] esitest service [puppet] - 10https://gerrit.wikimedia.org/r/793561 (https://phabricator.wikimedia.org/T308799) [18:23:23] (03CR) 10jerkins-bot: [V: 04-1] ganeti: add set_boot_media() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/793856 (https://phabricator.wikimedia.org/T306661) (owner: 10Volans) [18:34:42] (03CR) 10Dwisehaupt: [C: 03+1] "Ha, yes. Fix the typo." [dns] - 10https://gerrit.wikimedia.org/r/793725 (https://phabricator.wikimedia.org/T308672) (owner: 10Volans) [18:36:30] (03CR) 10Dwisehaupt: [C: 03+1] "This look correct and appropriate to me" [dns] - 10https://gerrit.wikimedia.org/r/793726 (https://phabricator.wikimedia.org/T308672) (owner: 10Volans) [18:41:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance [18:41:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance [18:41:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:52] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:52:10] (03PS1) 10Zabe: ulogd: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793858 (https://phabricator.wikimedia.org/T308013) [18:52:12] PROBLEM - Check systemd state on ms-be1062 is CRITICAL: CRITICAL - degraded: The following units failed: session-243916.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:54:48] (03PS1) 10Zabe: udev: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793859 (https://phabricator.wikimedia.org/T308013) [18:57:14] (03PS1) 10Zabe: trafficserver: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793860 (https://phabricator.wikimedia.org/T308013) [19:00:26] PROBLEM - Check systemd state on ms-be1033 is CRITICAL: CRITICAL - degraded: The following units failed: session-338037.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:04:08] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:04:17] (03CR) 10Ori: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/793862 (https://phabricator.wikimedia.org/T295698) (owner: 10Ori) [19:04:34] (03PS3) 10Ori: New service: function-evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/793862 (https://phabricator.wikimedia.org/T295698) [19:06:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1101.eqiad.wmnet with reason: Maintenance [19:06:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1101.eqiad.wmnet with reason: Maintenance [19:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T298555)', diff saved to https://phabricator.wikimedia.org/P28198 and previous config saved to /var/cache/conftool/dbconfig/20220520-190633-ladsgroup.json [19:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:40] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [19:29:18] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [19:45:19] (03CR) 10Dzahn: [C: 03+2] "being bold, just merging it and testing it on gitlab1004 .. since this is also a bit time constrained with the delays to get stuff into pr" [puppet] - 10https://gerrit.wikimedia.org/r/793534 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [19:49:20] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [20:01:26] (03PS1) 10Ladsgroup: Make IS.php return an array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793869 [20:02:45] (03PS1) 10Ladsgroup: Make CommonSettings load the array from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871 [20:05:14] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:13:16] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:13:34] (03PS2) 10Krinkle: wmf-config: Move loading/computing of $globals to a method for profiling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749762 (https://phabricator.wikimedia.org/T169821) [20:15:01] (03CR) 10jerkins-bot: [V: 04-1] wmf-config: Move loading/computing of $globals to a method for profiling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749762 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [20:15:12] (03PS1) 10Ladsgroup: Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 [20:16:14] (03CR) 10jerkins-bot: [V: 04-1] Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 (owner: 10Ladsgroup) [20:19:47] (03PS1) 10Dduvall: WIP docker_registry_ha: Support GitLab JSON Web Token auth [puppet] - 10https://gerrit.wikimedia.org/r/793875 (https://phabricator.wikimedia.org/T308501) [20:20:15] (03PS3) 10Krinkle: wmf-config: Move loading/computing of $globals to a method for profiling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749762 (https://phabricator.wikimedia.org/T169821) [20:20:17] (03PS2) 10Krinkle: MWConfigCacheGenerator: Move siteFromDB() deeper down the stack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749763 (https://phabricator.wikimedia.org/T169821) [20:20:40] (03PS2) 10Ladsgroup: Make CommonSettings load the array from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871 [20:20:42] (03PS2) 10Ladsgroup: Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 [20:22:12] (03CR) 10jerkins-bot: [V: 04-1] wmf-config: Move loading/computing of $globals to a method for profiling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749762 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [20:22:17] (03CR) 10jerkins-bot: [V: 04-1] MWConfigCacheGenerator: Move siteFromDB() deeper down the stack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749763 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [20:22:29] (03CR) 10jerkins-bot: [V: 04-1] Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 (owner: 10Ladsgroup) [20:29:19] (03CR) 10Krinkle: Move out ORES extension configuration out of InitialiseSettings.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 (owner: 10Ladsgroup) [20:30:16] (03PS3) 10Ladsgroup: Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 [20:30:21] (03CR) 10Ladsgroup: Move out ORES extension configuration out of InitialiseSettings.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 (owner: 10Ladsgroup) [20:31:35] (03CR) 10jerkins-bot: [V: 04-1] Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 (owner: 10Ladsgroup) [20:34:06] (03PS4) 10Ladsgroup: Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 [20:36:04] (03PS4) 10Krinkle: wmf-config: Move loading/computing of $globals to a method for profiling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749762 (https://phabricator.wikimedia.org/T169821) [20:36:06] (03PS3) 10Krinkle: MWConfigCacheGenerator: Move siteFromDB() deeper down the stack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749763 (https://phabricator.wikimedia.org/T169821) [20:42:52] 10SRE, 10MediaWiki-extensions-CodeReview, 10Platform Engineering, 10serviceops-radar, 10Patch-For-Review: Make an HTML dump of the output of the CodeReview extension on MediaWiki.org - https://phabricator.wikimedia.org/T205361 (10Legoktm) >>! In T205361#7815573, @Krinkle wrote: >>>! In T205361#7815521, @... [20:50:14] (03PS2) 10Volans: ganeti: add set_boot_media() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/793856 (https://phabricator.wikimedia.org/T306661) [21:00:23] 10SRE-tools, 10Discovery, 10Discovery-Search, 10Infrastructure-Foundations, 10IPv6: Some elastic hosts do not have IPv6 DNS records - https://phabricator.wikimedia.org/T271143 (10bking) Hey Volans, sorry I didn't get to this by end of week as promised; I was sick on Weds and Thurs. Starting Monday, some... [21:10:14] (03PS3) 10Ladsgroup: Make CommonSettings load the array from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871 [21:10:16] (03PS5) 10Ladsgroup: Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 [21:12:25] (03CR) 10jerkins-bot: [V: 04-1] Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 (owner: 10Ladsgroup) [21:12:45] (03CR) 10jerkins-bot: [V: 04-1] Make CommonSettings load the array from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871 (owner: 10Ladsgroup) [21:13:41] (03PS4) 10Ladsgroup: Make CommonSettings load the array from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871 [21:13:43] (03PS6) 10Ladsgroup: Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 [21:15:22] (03CR) 10jerkins-bot: [V: 04-1] Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 (owner: 10Ladsgroup) [21:15:39] (03CR) 10jerkins-bot: [V: 04-1] Make CommonSettings load the array from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871 (owner: 10Ladsgroup) [21:19:37] (03PS5) 10Ladsgroup: Make CommonSettings load the array from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871 [21:19:39] (03PS7) 10Ladsgroup: Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 [21:21:10] (03CR) 10jerkins-bot: [V: 04-1] Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 (owner: 10Ladsgroup) [21:21:19] (03CR) 10jerkins-bot: [V: 04-1] Make CommonSettings load the array from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871 (owner: 10Ladsgroup) [21:22:49] (03PS8) 10Ladsgroup: Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 [21:23:58] (03PS6) 10Ladsgroup: Make CommonSettings load the array from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871 [21:24:00] (03PS9) 10Ladsgroup: Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 [21:24:32] (03CR) 10jerkins-bot: [V: 04-1] Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 (owner: 10Ladsgroup) [21:25:14] (03CR) 10jerkins-bot: [V: 04-1] Make CommonSettings load the array from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871 (owner: 10Ladsgroup) [21:25:16] (03CR) 10jerkins-bot: [V: 04-1] Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 (owner: 10Ladsgroup) [21:26:36] (03PS7) 10Ladsgroup: Make CommonSettings load the array from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871 [21:26:38] (03PS10) 10Ladsgroup: Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 [21:27:35] (03CR) 10jerkins-bot: [V: 04-1] Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 (owner: 10Ladsgroup) [21:31:14] (03PS11) 10Ladsgroup: Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 [21:33:56] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on gitlab1004.wikimedia.org with reason: reimage [21:33:58] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gitlab1004.wikimedia.org with reason: reimage [21:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:38] !log reimaging gitlab1004 (insetup) to test partman recipe from gerrit:793534 - T307142 [21:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:42] T307142: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142 [21:36:05] !log attempt to use reimage cookbook failed: spicerack.netbox.NetboxHostNotFoundError [21:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:10] !log attempt to use reimage cookbook failed: spicerack.netbox.NetboxHostNotFoundError T307142 [21:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:20] !log correction: mistake was to use FQDN T307142 [21:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:26] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab1004.wikimedia.org with OS bullseye [21:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:20] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab1004.wikimedia.org with reason: host reimage [21:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:00] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:54:50] (03CR) 10Krinkle: Make CommonSettings load the array from IS.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871 (owner: 10Ladsgroup) [21:55:13] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab1004.wikimedia.org with reason: host reimage [21:55:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T298555)', diff saved to https://phabricator.wikimedia.org/P28201 and previous config saved to /var/cache/conftool/dbconfig/20220520-215514-ladsgroup.json [21:55:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [21:55:16] (03CR) 10Krinkle: Make CommonSettings load the array from IS.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871 (owner: 10Ladsgroup) [21:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [21:55:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:23] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [21:55:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 10%: Maint finished', diff saved to https://phabricator.wikimedia.org/P28202 and previous config saved to /var/cache/conftool/dbconfig/20220520-220046-ladsgroup.json [22:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:46] (03PS8) 10Ladsgroup: Make CommonSettings load the array from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871 [22:02:49] (03PS12) 10Ladsgroup: Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 [22:03:37] (03CR) 10Ladsgroup: Make CommonSettings load the array from IS.php (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871 (owner: 10Ladsgroup) [22:06:52] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab1004.wikimedia.org with OS bullseye [22:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:36] PROBLEM - Disk space on dumpsdata1003 is CRITICAL: DISK CRITICAL - free space: /data 877698 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops [22:08:00] 10SRE-tools, 10Discovery, 10Discovery-Search, 10Infrastructure-Foundations, 10IPv6: Some elastic hosts do not have IPv6 DNS records - https://phabricator.wikimedia.org/T271143 (10Volans) Sounds good to me. Thanks for the update :) [22:15:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 25%: Maint finished', diff saved to https://phabricator.wikimedia.org/P28203 and previous config saved to /var/cache/conftool/dbconfig/20220520-221550-ladsgroup.json [22:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db1102.eqiad.wmnet with reason: Maintenance [22:24:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db1102.eqiad.wmnet with reason: Maintenance [22:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:31] 10SRE, 10LDAP-Access-Requests: Grant Access to `wmf` for `Dmantena` - https://phabricator.wikimedia.org/T308294 (10Dmantena) Thanks for this information! Frankly, I'd prefer //not// to have production shell access and these elevated permissions. I'm just after a snapshot of the iOS notifications event dashboar... [22:30:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 75%: Maint finished', diff saved to https://phabricator.wikimedia.org/P28204 and previous config saved to /var/cache/conftool/dbconfig/20220520-223054-ladsgroup.json [22:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:29] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:33:24] 10SRE, 10Analytics, 10LDAP-Access-Requests: Grant Access to `wmf` for `Dmantena` - https://phabricator.wikimedia.org/T308294 (10Dzahn) I think we should escalate this directly to the analytics team for advice how to move forward. Let me add them. [22:33:28] 10SRE, 10Analytics, 10LDAP-Access-Requests: Grant Access to `wmf` for `Dmantena` - https://phabricator.wikimedia.org/T308294 (10Dzahn) [22:43:57] PROBLEM - Check systemd state on ms-be1060 is CRITICAL: CRITICAL - degraded: The following units failed: session-337818.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:45:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 100%: Maint finished', diff saved to https://phabricator.wikimedia.org/P28205 and previous config saved to /var/cache/conftool/dbconfig/20220520-224558-ladsgroup.json [22:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:07] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:16:21] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:22:00] (03PS1) 10Stang: rowiki: Use Romanian canonical name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793999 (https://phabricator.wikimedia.org/T127607) [23:32:30] (03PS1) 10Stang: Update IP addresses for Wiki Education Dashboard exemptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/794000 (https://phabricator.wikimedia.org/T308702) [23:42:32] (03PS19) 10Stang: Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789) [23:42:59] (03PS20) 10Stang: Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789) [23:43:48] (03CR) 10jerkins-bot: [V: 04-1] Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789) (owner: 10Stang) [23:45:18] (03PS21) 10Stang: Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789) [23:46:59] (03CR) 10jerkins-bot: [V: 04-1] Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789) (owner: 10Stang) [23:53:52] (03PS22) 10Stang: Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789) [23:54:20] (03PS23) 10Stang: Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789)