[00:04:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T60674)', diff saved to https://phabricator.wikimedia.org/P29165 and previous config saved to /var/cache/conftool/dbconfig/20220531-000452-ladsgroup.json [00:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:04:59] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [00:07:35] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:09:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P29166 and previous config saved to /var/cache/conftool/dbconfig/20220531-000937-ladsgroup.json [00:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:53] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:24:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P29167 and previous config saved to /var/cache/conftool/dbconfig/20220531-002442-ladsgroup.json [00:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T60674)', diff saved to https://phabricator.wikimedia.org/P29168 and previous config saved to /var/cache/conftool/dbconfig/20220531-003947-ladsgroup.json [00:39:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [00:39:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [00:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:55] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [00:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:40:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:57] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220531T0100) [01:11:33] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:13:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T60674)', diff saved to https://phabricator.wikimedia.org/P29169 and previous config saved to /var/cache/conftool/dbconfig/20220531-011335-ladsgroup.json [01:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:13:44] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [01:14:21] Hi ops, trying to track down a possible outage that is generating on-wiki reports.  Seems like a lot of things can't hit the replica database enwiki_p [01:15:04] T309570 is an example [01:15:04] T309570: quarry is unable to access enwiki_p.page table - https://phabricator.wikimedia.org/T309570 [01:15:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:18:02] @Amir1 some reports seem to be pointing that you may know more on this - but I could be reading them wrong [01:24:03] will try to follow up in #wikimedia-tech [01:26:56] (03PS1) 10PipelineBot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/801445 [01:28:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P29170 and previous config saved to /var/cache/conftool/dbconfig/20220531-012840-ladsgroup.json [01:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:37:59] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:40:10] (03PS3) 10Jforrester: deployment-prep: Drop deployment-restbase03, no longer to be used [puppet] - 10https://gerrit.wikimedia.org/r/790424 (https://phabricator.wikimedia.org/T306052) [01:43:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P29171 and previous config saved to /var/cache/conftool/dbconfig/20220531-014345-ladsgroup.json [01:43:47] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:48:05] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:53:03] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:56:19] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:58:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T60674)', diff saved to https://phabricator.wikimedia.org/P29172 and previous config saved to /var/cache/conftool/dbconfig/20220531-015850-ladsgroup.json [01:58:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:58:56] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [02:01:47] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [02:03:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:04:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:04:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:07:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:07:37] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.39.0-wmf.14 [core] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/801446 [02:07:41] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.39.0-wmf.14 [core] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/801446 (owner: 10TrainBranchBot) [02:22:43] (03Merged) 10jenkins-bot: Branch commit for wmf/1.39.0-wmf.14 [core] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/801446 (owner: 10TrainBranchBot) [02:27:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:30:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:38:59] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [02:45:59] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:47:37] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [03:31:45] (03PS1) 10Andrew Bogott: Magnum: update [trust] settings [puppet] - 10https://gerrit.wikimedia.org/r/801453 [03:33:12] (03CR) 10Andrew Bogott: [C: 03+2] Magnum: update [trust] settings [puppet] - 10https://gerrit.wikimedia.org/r/801453 (owner: 10Andrew Bogott) [03:57:03] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [04:00:55] ACKNOWLEDGEMENT - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project Andrew Bogott Im about to go to sleep but perhaps this ack will prevent a midnight page https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [04:07:25] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [04:24:43] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [04:31:11] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (netbox2002), Fresh: 111 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:49:47] (03PS1) 10KartikMistry: Fix Tyap (kcg) namespace names [core] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/801195 [04:58:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1130 with weight 0 T308725', diff saved to https://phabricator.wikimedia.org/P29173 and previous config saved to /var/cache/conftool/dbconfig/20220531-045824-root.json [04:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:58:33] T308725: Switchover s5 master db1100 -> db1130 - https://phabricator.wikimedia.org/T308725 [05:03:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 22 hosts with reason: Primary switchover s5 T308725 [05:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:38] T308725: Switchover s5 master db1100 -> db1130 - https://phabricator.wikimedia.org/T308725 [05:03:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 22 hosts with reason: Primary switchover s5 T308725 [05:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:09:45] (03PS2) 10Marostegui: mariadb: Promote db1130 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/801361 (https://phabricator.wikimedia.org/T308725) [05:16:33] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1130 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/801361 (https://phabricator.wikimedia.org/T308725) (owner: 10Marostegui) [05:21:44] 10SRE, 10ops-eqiad: Degraded RAID on cloudnet1004 - https://phabricator.wikimedia.org/T309576 (10ops-monitoring-bot) [05:32:21] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 112 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:39:17] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:40:09] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:42:19] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:44:33] PROBLEM - Host mw1334 is DOWN: PING CRITICAL - Packet loss = 100% [05:45:15] RECOVERY - Host mw1334 is UP: PING OK - Packet loss = 0%, RTA = 1.02 ms [05:49:19] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:00:05] kormat, marostegui, and Amir1: That opportune time is upon us again. Time for a Primary database switchover deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220531T0600). [06:00:19] Amir1: around? [06:00:50] o/ [06:00:53] yup [06:00:56] starting! [06:00:57] !log Starting s5 eqiad failover from db1100 to db1130 - T308725 [06:01:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:06] T308725: Switchover s5 master db1100 -> db1130 - https://phabricator.wikimedia.org/T308725 [06:01:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set s5 eqiad as read-only for maintenance - T308725', diff saved to https://phabricator.wikimedia.org/P29174 and previous config saved to /var/cache/conftool/dbconfig/20220531-060112-root.json [06:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1130 to s5 primary and set section read-write T308725', diff saved to https://phabricator.wikimedia.org/P29175 and previous config saved to /var/cache/conftool/dbconfig/20220531-060140-root.json [06:01:44] > Achtung: Die Datenbank wurde für Wartungsarbeiten gesperrt, [06:01:44] all done [06:01:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:02] cab write now [06:02:07] \o/ [06:02:49] (03CR) 10Marostegui: [C: 03+2] wmnet: Update s5-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/801362 (https://phabricator.wikimedia.org/T308725) (owner: 10Marostegui) [06:03:06] marostegui: do you want me for anything? [06:03:23] Amir1: nope you are good [06:03:43] coool. Thanks [06:03:52] Amir1: thanks for the help! [06:05:11] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:05:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1100 T308725', diff saved to https://phabricator.wikimedia.org/P29176 and previous config saved to /var/cache/conftool/dbconfig/20220531-060518-root.json [06:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:01] Later in the day I need the old master for schema changes. Please let me know once you're done with it marostegui [06:06:24] Amir1: will do, going to apply some schema changes there now [06:06:33] Awesome [06:07:14] We had 29 seconds of RO time [06:09:40] (03PS1) 10Marostegui: db1100: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/801615 [06:10:47] !log dbmaint s5@eqiad T298557 [06:10:53] (03CR) 10Marostegui: [C: 03+2] db1100: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/801615 (owner: 10Marostegui) [06:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:54] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [06:26:47] !log `elukey@an-master1001:~$ sudo systemctl reset-failed hadoop-clean-fairscheduler-event-logs.service` [06:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:19] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:42:17] RECOVERY - Check systemd state on an-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:46:25] (03PS1) 10Tim Starling: [WIP] Implement MediaWiki multi-DC traffic component [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) [06:47:03] (03CR) 10CI reject: [V: 04-1] [WIP] Implement MediaWiki multi-DC traffic component [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) (owner: 10Tim Starling) [06:53:15] jouncebot: next [06:53:15] In 0 hour(s) and 6 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220531T0700) [06:53:49] (03PS3) 10Muehlenhoff: helm/helmfile: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793774 (https://phabricator.wikimedia.org/T308013) [06:54:43] I'll +2 few minutes ahead as CI will take some time to merge the core patch to wmf.13. [06:54:48] ie https://gerrit.wikimedia.org/r/c/mediawiki/core/+/801195 [07:00:05] Amir1 and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220531T0700). [07:00:05] kart_, aharoni, and sharvani_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:32] (03PS1) 10Marostegui: es1022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/801623 (https://phabricator.wikimedia.org/T309265) [07:00:40] (03PS2) 10Sharvaniharan: Stream config for android breadcrumbs schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801018 [07:00:45] OK. I'll go ahead with my config patch first.. [07:01:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1022 for migration to 10.6', diff saved to https://phabricator.wikimedia.org/P29178 and previous config saved to /var/cache/conftool/dbconfig/20220531-070058-root.json [07:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:49] (03CR) 10KartikMistry: [C: 03+2] testwiki: Enable Section Translation in 10 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800833 (https://phabricator.wikimedia.org/T308829) (owner: 10KartikMistry) [07:01:59] (03CR) 10Marostegui: [C: 03+2] es1022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/801623 (https://phabricator.wikimedia.org/T309265) (owner: 10Marostegui) [07:02:40] (03Merged) 10jenkins-bot: testwiki: Enable Section Translation in 10 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800833 (https://phabricator.wikimedia.org/T308829) (owner: 10KartikMistry) [07:02:42] (03PS1) 10Muehlenhoff: Point idp-test to idp-test1002 [dns] - 10https://gerrit.wikimedia.org/r/801624 (https://phabricator.wikimedia.org/T308214) [07:03:19] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:05:00] (03CR) 10Muehlenhoff: [C: 03+2] helm/helmfile: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793774 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [07:05:55] Things looks good. Deploying.. [07:06:21] Hallo world [07:07:56] aharoni: I'll +2 your patch now. [07:08:03] here :) [07:08:09] (03CR) 10KartikMistry: [C: 03+2] Fix Tyap (kcg) namespace names [core] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/801195 (owner: 10KartikMistry) [07:08:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:09:12] !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:800833|testwiki: Enable Section Translation in 10 Wikipedias (T308829)]] (duration: 03m 02s) [07:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:19] T308829: Enable Section Translation on 10 Wikipedias where Content Translation is available by default - https://phabricator.wikimedia.org/T308829 [07:09:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:09:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:10:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:10:27] sharvani_: you want to self-deploy? [07:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:59] I dont know how to... could you please deploy mine too.. [07:11:37] OK. Would you able to test the change on mwdebug1001? [07:11:50] yes i can ... thank you so much! [07:12:15] OK. Wait. Let me merge and pull it to mwdebug1001. [07:12:18] kart_, is the Tyap patch ready for testing? [07:12:39] (03PS3) 10KartikMistry: Stream config for android breadcrumbs schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801018 (owner: 10Sharvaniharan) [07:12:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:11] (03PS1) 10Marostegui: Revert "es1022: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/801199 [07:14:01] (03CR) 10KartikMistry: [C: 03+2] Stream config for android breadcrumbs schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801018 (owner: 10Sharvaniharan) [07:14:29] aharoni: no. CI will take 10 more minutes :) [07:14:37] Oh, OK. [07:14:47] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:14:54] (03Merged) 10jenkins-bot: Stream config for android breadcrumbs schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801018 (owner: 10Sharvaniharan) [07:14:56] (03CR) 10Marostegui: [C: 03+2] Revert "es1022: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/801199 (owner: 10Marostegui) [07:15:21] I thought it's faster for backported commits. [07:15:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 1%: After migrating it to 10.6', diff saved to https://phabricator.wikimedia.org/P29179 and previous config saved to /var/cache/conftool/dbconfig/20220531-071522-root.json [07:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:54] sharvani_: Your patch is available on mwdebug1001. Please test. [07:16:41] Tested. looks good. thank you so much for deploying. [07:16:43] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:17:00] sharvani_: now doing actual deployment :) [07:17:28] 😃👍 [07:17:32] !log push new pfw firewall rules - T309236 [07:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:29] sharvani_: I think scap is still taking some time.. [07:20:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:20:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:57] !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:801018|Stream config for android breadcrumbs schema]] (duration: 03m 09s) [07:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:07] OK. Done now. [07:21:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:55] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frbackup2002 - https://phabricator.wikimedia.org/T306842 (10ayounsi) [07:22:22] Thank you Kartik. [07:24:56] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:26:01] (03Merged) 10jenkins-bot: Fix Tyap (kcg) namespace names [core] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/801195 (owner: 10KartikMistry) [07:26:19] kart_ I'm back... browser+IRC crashed. Is CI done? [07:26:38] Amir1: Just done. Deploying and will let you know in a minute to test. [07:26:53] oh. Wrong amir. aharoni ^ [07:27:39] !log add profile k8s_mlstaging + authkey for ml-staging k8s - T302195 [07:27:40] 😁😁😁 [07:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:47] T302195: Create the ml-serve-staging k8s cluster - https://phabricator.wikimedia.org/T302195 [07:27:51] I forgot "PKI" sigh [07:27:52] amending [07:28:06] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:28:39] (03CR) 10Volans: [C: 03+2] sre.hosts.reboot-single: hide cumin progress [cookbooks] - 10https://gerrit.wikimedia.org/r/801414 (owner: 10Volans) [07:28:55] aharoni: Can you test on the mwdebug1001? [07:29:04] Doing [07:29:58] (03PS2) 10Tim Starling: [WIP] Implement MediaWiki multi-DC traffic component [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) [07:30:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 5%: After migrating it to 10.6', diff saved to https://phabricator.wikimedia.org/P29180 and previous config saved to /var/cache/conftool/dbconfig/20220531-073026-root.json [07:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:04] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:31:10] (03CR) 10CI reject: [V: 04-1] [WIP] Implement MediaWiki multi-DC traffic component [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) (owner: 10Tim Starling) [07:31:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:53] kart_, I don't yet see it. I selected mwdebug1001.eqiad.wmnet, set the browser extension to "On", and loaded Special:AllPages. I still see the old namespaces names. [07:32:08] (03Merged) 10jenkins-bot: sre.hosts.reboot-single: hide cumin progress [cookbooks] - 10https://gerrit.wikimedia.org/r/801414 (owner: 10Volans) [07:32:43] aharoni: can you test again? [07:33:43] Just tested again, and I still see the old names. https://kcg.wikipedia.org/wiki/A%E2%80%8C%CC%B1%C3%A1%C3%ADkhapsak:AllPages [07:34:17] In the "A̱ghwop-a̱lyoot:" dropdown, I expect to see "Sa" in the middle of the list, but I see "Sot", which is the old name. [07:34:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:34:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:39] yeah. Tested. It might localisation cache issue and will take time. [07:35:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:48] aharoni: I'll go ahead with deployment and see if it works fine with it. If not, we can check again. [07:36:54] OK :) [07:36:58] kart_: I think when I've done it before it's needed a rebuild [07:37:25] RhinosF1: so, what should do on mwdebug1001? [07:37:34] fyi https://wikitech.wikimedia.org/wiki/How_to_deploy_code#More_complex_changes:_sync_everything [07:37:47] (03PS1) 10Muehlenhoff: Grant access to Superset for theresnotime [puppet] - 10https://gerrit.wikimedia.org/r/801626 (https://phabricator.wikimedia.org/T309383) [07:38:06] PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:38:18] kart_: you can't test this change. It'll need to just sync out [07:38:34] With sync-world as Nikerabbit says [07:38:54] RhinosF1: OK. Let me deploy then. [07:38:56] (03CR) 10CI reject: [V: 04-1] Grant access to Superset for theresnotime [puppet] - 10https://gerrit.wikimedia.org/r/801626 (https://phabricator.wikimedia.org/T309383) (owner: 10Muehlenhoff) [07:39:15] Are namespace names in the localization cache? They are not usual messages. [07:39:54] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:40:20] aharoni: yes they are, all i18n stuff is [07:40:21] (03PS2) 10Muehlenhoff: Grant access to Superset for theresnotime [puppet] - 10https://gerrit.wikimedia.org/r/801626 (https://phabricator.wikimedia.org/T309383) [07:40:55] kart_: as an fyi, I'm disappearing in 5 minutes [07:42:16] !log kartik@deploy1002 Synchronized php-1.39.0-wmf.13/languages/messages/MessagesKcg.php: Backport: [[gerrit:801195|Fix Tyap (kcg) namespace names]] (duration: 03m 01s) [07:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:55] 10SRE, 10SRE-tools, 10Icinga, 10Infrastructure-Foundations, 10observability: Icinga paged for a host that should have been downtimed - https://phabricator.wikimedia.org/T309447 (10fgiunchedi) >>! In T309447#7966840, @Volans wrote: >>>! In T309447#7966236, @fgiunchedi wrote: >> Off the top of my head I ca... [07:43:01] (03PS1) 10Marostegui: es1022: Install 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/801627 (https://phabricator.wikimedia.org/T309265) [07:43:34] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:43:35] (03CR) 10Muehlenhoff: [C: 03+2] Grant access to Superset for theresnotime [puppet] - 10https://gerrit.wikimedia.org/r/801626 (https://phabricator.wikimedia.org/T309383) (owner: 10Muehlenhoff) [07:44:27] 10SRE, 10SRE-tools, 10Icinga, 10Infrastructure-Foundations, 10observability: Icinga paged for a host that should have been downtimed - https://phabricator.wikimedia.org/T309447 (10Marostegui) >>! In T309447#7969207, @fgiunchedi wrote: >>>! In T309447#7966840, @Volans wrote: >>>>! In T309447#7966236, @fgi... [07:44:46] 10SRE, 10SRE-tools, 10Icinga, 10Infrastructure-Foundations, 10observability: Icinga paged for a host that should have been downtimed - https://phabricator.wikimedia.org/T309447 (10Volans) >>! In T309447#7969207, @fgiunchedi wrote: > Since this is hopefully rare, personally I think we should focus on movi... [07:45:00] (03CR) 10DCausse: [C: 03+1] "thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799272 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [07:45:06] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to PII in Superset for TheresNoTime - https://phabricator.wikimedia.org/T309383 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff I've just merged a patch to enable Sammy's access. It will take up to 30 minutes until the ch... [07:45:12] (03CR) 10Marostegui: [C: 03+2] es1022: Install 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/801627 (https://phabricator.wikimedia.org/T309265) (owner: 10Marostegui) [07:45:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 10%: After migrating it to 10.6', diff saved to https://phabricator.wikimedia.org/P29181 and previous config saved to /var/cache/conftool/dbconfig/20220531-074530-root.json [07:45:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:11] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudnet1004 - https://phabricator.wikimedia.org/T309576 (10Peachey88) [07:50:47] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudnet1004 - https://phabricator.wikimedia.org/T309576 (10dcaro) Snippet from dmesg on the failure event: ` [Tue May 31 05:07:39 2022] ata2.00: READ LOG DMA EXT failed, trying PIO [Tue May 31 05:07:39 2022] ata2: failed to read log page 1... [07:53:12] 10SRE, 10LDAP-Access-Requests: Grant Access to nda for ozhang - https://phabricator.wikimedia.org/T309559 (10MoritzMuehlenhoff) Hi, we need a some additional data here, then we should be good to go: - @JMinor : Can you please clarify whether Superset access with or without access to private data is needed? S... [07:55:20] !log upgrade fastnetmon on netflow4002 to 1.2.1 T271228 [07:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:26] T271228: Upgrade Fastnetmon to 1.2.1 - https://phabricator.wikimedia.org/T271228 [07:59:03] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Beta Cluster: ship logs from docker services to logstash [puppet] - 10https://gerrit.wikimedia.org/r/800282 (https://phabricator.wikimedia.org/T309319) (owner: 10Ori) [07:59:46] 10SRE, 10LDAP-Access-Requests: Grant Access to nda for ozhang - https://phabricator.wikimedia.org/T309559 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:00:04] tgr: Your horoscope predicts another unfortunate Custom deployment window for session handling fix deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220531T0800). [08:00:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 20%: After migrating it to 10.6', diff saved to https://phabricator.wikimedia.org/P29182 and previous config saved to /var/cache/conftool/dbconfig/20220531-080034-root.json [08:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:36] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:02:40] tgr: hi [08:03:00] kostajh: O/ [08:03:27] tgr: are you going to squash the second patch into the first one, or backport that one separately? [08:04:11] kart_: still planning to sync-world or are you done? [08:04:23] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade Fastnetmon to 1.2.1 - https://phabricator.wikimedia.org/T271228 (10MoritzMuehlenhoff) This was the debconf diff for the puppetised fastnetmon.conf as presented by dpkg. We should check whether some new options should be covered in our puppetised config f... [08:04:38] kostajh: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/799388 is already squashed [08:05:31] cool [08:06:24] tgr: I was reading how-to do that. Give me a minute. [08:07:02] !log imported fastnetmon 1.2.1-1~deb11u1 to apt.wikimedia.org T271228 [08:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:09] T271228: Upgrade Fastnetmon to 1.2.1 - https://phabricator.wikimedia.org/T271228 [08:08:24] (03PS1) 10Giuseppe Lavagetto: rsyslog: do not use the same queue name for two logs [puppet] - 10https://gerrit.wikimedia.org/r/801628 [08:08:48] tgr: It is OK to run it now? [08:09:06] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:09:43] kart_: sure [08:09:46] tgr: Looks good? `/srv/mediawiki-staging scap-world 'Backport: [[gerrit:801195|Fix Tyap (kcg) namespace names]]'` [08:09:59] (03CR) 10Jbond: [V: 03+1 C: 03+2] naggen2: inject # page alias for critical hosts [puppet] - 10https://gerrit.wikimedia.org/r/801388 (https://phabricator.wikimedia.org/T236379) (owner: 10Jbond) [08:10:24] looks good, if the first part of that command is the current dir [08:10:31] Yep :) [08:10:54] I think the command is 'scap sync-world' [08:11:10] ah. Right. [08:11:36] !log kartik@deploy1002 Started scap: Backport: [[gerrit:801195|Fix Tyap (kcg) namespace names]] [08:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:51] I'll start merging in the meanwhile [08:12:01] tgr: OK. Thanks! [08:12:30] (03CR) 10Gergő Tisza: [C: 03+2] Tombstone the old session on SessionBackend::resetId() [core] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/799388 (https://phabricator.wikimedia.org/T299193) (owner: 10Gergő Tisza) [08:12:45] aharoni: Fixing cache. Will reflect in few minutes once done. [08:13:23] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudnet1004 - https://phabricator.wikimedia.org/T309576 (10MoritzMuehlenhoff) p:05Triage→03Medium The server is five years old and the new servers which replace cloudnet1003/1004 are currently being racked in https://phabricator.wikime... [08:14:24] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:15:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 25%: After migrating it to 10.6', diff saved to https://phabricator.wikimedia.org/P29183 and previous config saved to /var/cache/conftool/dbconfig/20220531-081538-root.json [08:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:58] (03CR) 10Filippo Giunchedi: [C: 03+1] "See inline otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/801628 (owner: 10Giuseppe Lavagetto) [08:18:22] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:23:50] (03PS3) 10Tim Starling: [WIP] Implement MediaWiki multi-DC traffic component [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) [08:24:14] (03PS1) 10Muehlenhoff: backup: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/801631 (https://phabricator.wikimedia.org/T308013) [08:24:15] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudnet1004 - https://phabricator.wikimedia.org/T309576 (10dcaro) > Since we won't fix the hardware of 1004 anymore, we could instead speed up the setup of the 1005/1006 and just live with the RAID state of the cloudnet backup to be degrad... [08:24:22] (03PS1) 10Muehlenhoff: arclamp: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/801632 (https://phabricator.wikimedia.org/T308013) [08:24:26] (03PS1) 10Muehlenhoff: auditd: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/801633 (https://phabricator.wikimedia.org/T308013) [08:24:30] (03PS1) 10Muehlenhoff: etherpad: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/801634 (https://phabricator.wikimedia.org/T308013) [08:24:34] (03PS1) 10Muehlenhoff: chartmuseum: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/801635 (https://phabricator.wikimedia.org/T308013) [08:24:38] (03PS1) 10Muehlenhoff: puppetdb: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/801636 (https://phabricator.wikimedia.org/T308013) [08:24:42] (03PS1) 10Muehlenhoff: karapace: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/801637 (https://phabricator.wikimedia.org/T308013) [08:24:46] (03PS1) 10Muehlenhoff: clamav: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/801638 (https://phabricator.wikimedia.org/T308013) [08:24:50] (03PS1) 10Muehlenhoff: kartotherian: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/801639 (https://phabricator.wikimedia.org/T308013) [08:24:54] (03PS1) 10Muehlenhoff: matomo: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/801640 [08:24:58] (03PS1) 10Muehlenhoff: python_deploy: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/801641 (https://phabricator.wikimedia.org/T308013) [08:25:02] (03CR) 10CI reject: [V: 04-1] [WIP] Implement MediaWiki multi-DC traffic component [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) (owner: 10Tim Starling) [08:25:10] (03PS1) 10Filippo Giunchedi: Run isort/black on the codebase [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/801642 (https://phabricator.wikimedia.org/T309546) [08:25:14] (03PS1) 10Filippo Giunchedi: tox: add formattercheck [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/801643 (https://phabricator.wikimedia.org/T309546) [08:25:18] (03PS1) 10Filippo Giunchedi: Use etcdmirror namespace for metrics [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/801644 (https://phabricator.wikimedia.org/T309546) [08:25:19] tgr: scap world still running! [08:25:22] (03PS1) 10Filippo Giunchedi: Export lag as a Gauge metric [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/801645 (https://phabricator.wikimedia.org/T309546) [08:25:22] kostajh: T299193#7969368 is the test plan [08:25:23] T299193: MediaWiki login failure due to race condition with session cookie - https://phabricator.wikimedia.org/T299193 [08:25:26] (03PS1) 10Filippo Giunchedi: Port to Python 3.5 [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/801646 (https://phabricator.wikimedia.org/T309546) [08:25:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10dcaro) [08:25:35] (I might have overdone it a little) [08:25:38] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudnet1004 - https://phabricator.wikimedia.org/T309576 (10dcaro) [08:26:00] kart_: that's normal, takes 20 mins or so [08:27:36] (03PS4) 10Tim Starling: [WIP] Implement MediaWiki multi-DC traffic component [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) [08:28:21] !log kartik@deploy1002 Finished scap: Backport: [[gerrit:801195|Fix Tyap (kcg) namespace names]] (duration: 16m 44s) [08:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:29] tgr: ok [08:28:32] tgr: ah OK. Just done it seems :) [08:28:49] kart_ The Tyap patch is working in production now. Thank you. [08:29:12] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:29:16] (03Merged) 10jenkins-bot: Tombstone the old session on SessionBackend::resetId() [core] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/799388 (https://phabricator.wikimedia.org/T299193) (owner: 10Gergő Tisza) [08:29:26] aharoni: I was about to ping you. Thanks! Learned to scap the world :) [08:30:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 40%: After migrating it to 10.6', diff saved to https://phabricator.wikimedia.org/P29184 and previous config saved to /var/cache/conftool/dbconfig/20220531-083042-root.json [08:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:02] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/801631 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [08:33:36] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:33:39] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/801632 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [08:34:13] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/801636 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [08:34:18] kart_: you are finished with the backport window, right? [08:34:26] (03CR) 10CI reject: [V: 04-1] matomo: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/801640 (owner: 10Muehlenhoff) [08:34:58] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade Fastnetmon to 1.2.1 - https://phabricator.wikimedia.org/T271228 (10ayounsi) Great, there is nothing of immediate interest in the diff. IPv6 will probably be the next step here in a different task. [08:35:33] !log upgrade fastnetmon to 1.2.1 in drmrs - T271228 [08:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:39] T271228: Upgrade Fastnetmon to 1.2.1 - https://phabricator.wikimedia.org/T271228 [08:35:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:36:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:36:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:36] !log upgrade fastnetmon to 1.2.1 in codfw - T271228 [08:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:44] I'll take that as a yes :) [08:40:55] (patch is on mwdebug1001) [08:44:43] tgr: cool, I will start testing [08:45:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 50%: After migrating it to 10.6', diff saved to https://phabricator.wikimedia.org/P29185 and previous config saved to /var/cache/conftool/dbconfig/20220531-084546-root.json [08:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:48] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade Fastnetmon to 1.2.1 - https://phabricator.wikimedia.org/T271228 (10ayounsi) left are eqiad/esams/eqsin. I'll take care of them later today or tomorrow. [08:50:57] so far so good... [08:51:18] 10SRE, 10SRE-tools, 10Icinga, 10Infrastructure-Foundations, 10observability: Icinga paged for a host that should have been downtimed - https://phabricator.wikimedia.org/T309447 (10fgiunchedi) >>! In T309447#7969225, @Volans wrote: >>>! In T309447#7969207, @fgiunchedi wrote: >> Since this is hopefully rar... [08:51:57] tgr: I do see one "persisting session for unknown reason" log message https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-mediawiki-2022.05.31?id=XbRPGYEBv2PODa7FgdPq [08:53:48] those logs are not super useful unfortunately [08:54:16] right [08:54:48] I should have phrased the message better, it really just means it's not possible to figure out at the site of logging why the save was initiated [08:55:06] the arwiki logout issue is still happening though [08:55:27] not entirely unexpected, but unfortunate [08:56:27] (03CR) 10Volans: [C: 03+1] "I've actually realized there is a problem with this patch." [puppet] - 10https://gerrit.wikimedia.org/r/801388 (https://phabricator.wikimedia.org/T236379) (owner: 10Jbond) [08:57:04] (03CR) 10Volans: [C: 03+1] "LGTM, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/801641 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [08:57:14] (03CR) 10Filippo Giunchedi: "Note I haven't tried building the Debian package yet with the py3 deps" [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/801646 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi) [08:58:54] tgr: what should happen if I manually delete the session in ObjectCache? (I am still logged in) [08:59:11] tgr: hmm, I did not reproduce the logout issue. How did you do it? [08:59:31] just luck, I suppose? I just did a normal signup [09:00:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 60%: After migrating it to 10.6', diff saved to https://phabricator.wikimedia.org/P29186 and previous config saved to /var/cache/conftool/dbconfig/20220531-090050-root.json [09:00:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:59] on a non-centralauth wiki, if you have used "remember me" for login, you should stay logged in after deleting from the session store; if you haven't, you shouldn't. On a centralauth wiki, I'm not 100% sure. I think even a non-"remember me" login should recover via the central session store there. [09:02:40] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:02:46] (03PS1) 10Jbond: naggen: only apply alias injection for hosts [puppet] - 10https://gerrit.wikimedia.org/r/801648 [09:03:39] (03CR) 10CI reject: [V: 04-1] naggen: only apply alias injection for hosts [puppet] - 10https://gerrit.wikimedia.org/r/801648 (owner: 10Jbond) [09:04:41] (03PS2) 10Jbond: naggen: only apply alias injection for hosts [puppet] - 10https://gerrit.wikimedia.org/r/801648 (https://phabricator.wikimedia.org/T236379) [09:05:43] (03CR) 10CI reject: [V: 04-1] naggen: only apply alias injection for hosts [puppet] - 10https://gerrit.wikimedia.org/r/801648 (https://phabricator.wikimedia.org/T236379) (owner: 10Jbond) [09:06:16] tgr: I've updated https://phabricator.wikimedia.org/T299193#7969512 with the testing I've done so far. It seems safe enough to sync, IMO. [09:06:24] (03PS3) 10Jbond: naggen: only apply alias injection for hosts [puppet] - 10https://gerrit.wikimedia.org/r/801648 (https://phabricator.wikimedia.org/T236379) [09:06:30] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:07:18] (03CR) 10CI reject: [V: 04-1] naggen: only apply alias injection for hosts [puppet] - 10https://gerrit.wikimedia.org/r/801648 (https://phabricator.wikimedia.org/T236379) (owner: 10Jbond) [09:13:04] (03CR) 10Muehlenhoff: [C: 03+2] idp-test: Point to the new Bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/801402 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [09:14:18] !log btullis@cumin1001 START - Cookbook sre.hadoop.roll-restart-workers restart workers for Hadoop test cluster: Roll restart of jvm daemons for openjdk upgrade. [09:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:14] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:15:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 75%: After migrating it to 10.6', diff saved to https://phabricator.wikimedia.org/P29187 and previous config saved to /var/cache/conftool/dbconfig/20220531-091553-root.json [09:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:40] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:20:08] (03PS6) 10Slyngshede: P:aptrepo::private refactor code to allow for multiple repositories. [puppet] - 10https://gerrit.wikimedia.org/r/799340 [09:25:31] (03PS7) 10Slyngshede: P:aptrepo::private refactor code to allow for multiple repositories. [puppet] - 10https://gerrit.wikimedia.org/r/799340 [09:25:44] <_joe_> jouncebot: next [09:25:44] In 3 hour(s) and 34 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220531T1300) [09:26:18] !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0) restart workers for Hadoop test cluster: Roll restart of jvm daemons for openjdk upgrade. [09:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:03] kostajh: I get the logout bug fairly consistently on arwiki (with or without the patch). It's not actually related to the welcome survey - even if I just reload the main page, once out of 3-4 attempts I will be logged out. If not directly after signup, it's not so obvious because centralauth autologin recovers the session. [09:29:50] (But the welcome survey probably breaks centralauth autologin, so I need to log in again to get it going.) [09:30:53] tgr: right, I don't think welcomesurvey is implicated here. It is triggered by the HTTP requests made on Special:CreateAccount for validating the password and username (or sometimes when assets from other wikis are loaded via site JS) [09:30:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 100%: After migrating it to 10.6', diff saved to https://phabricator.wikimedia.org/P29188 and previous config saved to /var/cache/conftool/dbconfig/20220531-093057-root.json [09:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:20] _joe_: I'm still dragging out the custom deploy window. Probably won't interfere unless you are about to deploy MediaWiki changes. [09:31:58] <_joe_> tgr: no actually I missed the deploy window, I wanted to check one thing during deploys [09:33:10] <_joe_> namely, if canaries were properly restarted during the deployment and thus picked up the changes [09:33:52] kostajh: welcomesurvey probably breaks centralauth login on signup by changing the redirection URL. (I'll write a patch for that later, should be straightforward.) But it doesn't interfere otherwise. [09:34:07] _joe_: I can ping you when I am syncing the patch. [09:34:43] (03CR) 10David Caro: [C: 03+1] "Just noting that this makes eqiad ceph cluster drop the network 10.192.20.0/24, that is cloud-hosts1-codfw, that before it was allowing ac" [puppet] - 10https://gerrit.wikimedia.org/r/801380 (owner: 10Majavah) [09:34:56] (03CR) 10Slyngshede: P:aptrepo::private refactor code to allow for multiple repositories. (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/799340 (owner: 10Slyngshede) [09:35:00] <_joe_> tgr: thanks, would appreciate [09:35:02] tgr: is that the patch we went back and forth on in T267273? [09:35:04] T267273: [arwiki] Submitting a POST on a form redirected to immediately after account creation sometimes logs user out - https://phabricator.wikimedia.org/T267273 [09:36:25] oh, right, sorry, you already had a patch for that. The redirect hook one, yeah. [09:37:12] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:37:23] Wrt the logouts, I only get it on arwiki, and only shortly after login. But it can be a normal login (not a signup), and "shortly" can mean something like 10 seconds - I don't think the race condition theory holds up. [09:39:12] (03CR) 10Filippo Giunchedi: prometheus::blackbox::check: add new blackbox exporter check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond) [09:40:51] tgr: hmm, it happens on normal login? That I have not seen before [09:42:13] <_joe_> tgr: if you have the request-id for your request, we might have logs in sessionstore regarding that request [09:42:25] <_joe_> the one where you resulted as logged-out [09:43:32] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add clearification to comment, to help avoid mistakes using httpd::site. [puppet] - 10https://gerrit.wikimedia.org/r/797110 (owner: 10Slyngshede) [09:43:46] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:43:53] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add clearification to comment, to help avoid mistakes using httpd::site. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/797110 (owner: 10Slyngshede) [09:44:28] (03CR) 10Muehlenhoff: P:aptrepo::private refactor code to allow for multiple repositories. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/799340 (owner: 10Slyngshede) [09:44:41] (03CR) 10David Caro: [C: 03+1] P:ceph: cleanup firewall rules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/801380 (owner: 10Majavah) [09:45:35] kostajh: yeah. Log in without "remember me", keep refreshing the main page, look at the user toolbar (on CA autologin it's anonymous for a second, then gets replaced) [09:46:00] RECOVERY - Host ms-be2066 is UP: PING OK - Packet loss = 0%, RTA = 31.65 ms [09:46:12] (current chrome, incognito window. Although the local session shouldn't really be browser-dependent in theory.) [09:47:12] actually "remember me" does not seem relevant [09:47:15] (03CR) 10Giuseppe Lavagetto: service::docker: refresh service when config file is changed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799420 (owner: 10Ori) [09:47:17] ACKNOWLEDGEMENT - MD RAID on ms-be2066 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T309595 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [09:47:21] 10SRE, 10ops-codfw: Degraded RAID on ms-be2066 - https://phabricator.wikimedia.org/T309595 (10ops-monitoring-bot) [09:47:46] RECOVERY - Check systemd state on ms-be2066 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:48:50] tgr: I don't see the anonymous toolbar if I disable JS [09:49:41] _joe_: thanks, for context, we are debugging T299193 [09:49:41] T299193: MediaWiki login failure due to race condition with session cookie - https://phabricator.wikimedia.org/T299193 [09:50:56] RECOVERY - Check no envoy runtime configuration is left persistent on idp-test2002 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [09:53:04] (03PS2) 10Slyngshede: Add clarification to comment, to help avoid mistakes using httpd::site. [puppet] - 10https://gerrit.wikimedia.org/r/797110 [09:53:23] (03CR) 10Slyngshede: Add clarification to comment, to help avoid mistakes using httpd::site. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/797110 (owner: 10Slyngshede) [09:53:40] (03PS2) 10Alexandros Kosiaris: eventgate-analytics: Bump 2022-05-30-145633-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/801409 (https://phabricator.wikimedia.org/T306181) [09:54:13] (03CR) 10Jbond: [V: 03+1 C: 03+2] naggen2: inject # page alias for critical hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/801388 (https://phabricator.wikimedia.org/T236379) (owner: 10Jbond) [09:55:02] (03CR) 10Alexandros Kosiaris: [C: 03+1] etherpad: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/801634 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [09:56:31] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Overall LGTM, great work. All my comments can be addressed at a later time." [puppet] - 10https://gerrit.wikimedia.org/r/799342 (owner: 10Jbond) [09:56:37] (03CR) 10DannyS712: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/801201 (https://phabricator.wikimedia.org/T308013) (owner: 10DannyS712) [09:56:59] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10DannyS712) [09:57:05] (03CR) 10Volans: "The logic looks good, apart the CI issues. Also one comment inline." [puppet] - 10https://gerrit.wikimedia.org/r/801648 (https://phabricator.wikimedia.org/T236379) (owner: 10Jbond) [09:59:14] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:02:27] (03PS1) 10Marostegui: Revert "db1100: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/801202 [10:03:04] 10SRE, 10WMF-General-or-Unknown, 10WMF-Legal, 10Documentation, and 2 others: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270 (10Meno25) [10:03:26] (03CR) 10Marostegui: [C: 03+2] Revert "db1100: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/801202 (owner: 10Marostegui) [10:03:38] d'oh. kostajh sorry I just realized I backported to the wrong branch :( should have used wmf-12, not wmf-13 [10:04:15] (03PS1) 10Gergő Tisza: Tombstone the old session on SessionBackend::resetId() [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/801203 (https://phabricator.wikimedia.org/T299193) [10:04:17] tgr: oh... I didn't see that either. :\ [10:05:04] (03CR) 10Gergő Tisza: [C: 03+2] Tombstone the old session on SessionBackend::resetId() [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/801203 (https://phabricator.wikimedia.org/T299193) (owner: 10Gergő Tisza) [10:07:06] (03PS1) 10Jelto: gitlab: make gitlab1003 new replica [puppet] - 10https://gerrit.wikimedia.org/r/801651 (https://phabricator.wikimedia.org/T307142) [10:08:04] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:10:16] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:12:49] (03PS4) 10Jbond: wmflib::service: add data loader class [puppet] - 10https://gerrit.wikimedia.org/r/799342 [10:13:18] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35620/console" [puppet] - 10https://gerrit.wikimedia.org/r/801651 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [10:14:21] (03PS2) 10Muehlenhoff: Point idp-test to idp-test1002 [dns] - 10https://gerrit.wikimedia.org/r/801624 (https://phabricator.wikimedia.org/T308214) [10:16:49] (03CR) 10CI reject: [V: 04-1] wmflib::service: add data loader class [puppet] - 10https://gerrit.wikimedia.org/r/799342 (owner: 10Jbond) [10:17:12] (03CR) 10Muehlenhoff: [C: 03+2] Point idp-test to idp-test1002 [dns] - 10https://gerrit.wikimedia.org/r/801624 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [10:17:28] (03CR) 10Jbond: "Running PCC (https://puppet-compiler.wmflabs.org/pcc-worker1002/35621/) but think i have addresses all comments" [puppet] - 10https://gerrit.wikimedia.org/r/799342 (owner: 10Jbond) [10:19:10] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks, merging." [puppet] - 10https://gerrit.wikimedia.org/r/801201 (https://phabricator.wikimedia.org/T308013) (owner: 10DannyS712) [10:20:02] (03PS1) 10Jelto: wikimedia.org: make gitlab1003 the new gitlab-replica [dns] - 10https://gerrit.wikimedia.org/r/801652 (https://phabricator.wikimedia.org/T307142) [10:21:04] (03CR) 10Cathal Mooney: [C: 03+1] "I'm not an authority on puppetcode, but the intent seems clear and makes sense to me so +1." [puppet] - 10https://gerrit.wikimedia.org/r/801380 (owner: 10Majavah) [10:22:51] (03Merged) 10jenkins-bot: Tombstone the old session on SessionBackend::resetId() [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/801203 (https://phabricator.wikimedia.org/T299193) (owner: 10Gergő Tisza) [10:23:20] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:23:58] (03PS2) 10Muehlenhoff: python_deploy: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/801641 (https://phabricator.wikimedia.org/T308013) [10:25:19] (03CR) 10Alexandros Kosiaris: [C: 03+2] eventgate-analytics: Bump 2022-05-30-145633-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/801409 (https://phabricator.wikimedia.org/T306181) (owner: 10Alexandros Kosiaris) [10:28:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:29:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:29:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:25] (03PS5) 10Jbond: wmflib::service: add data loader class [puppet] - 10https://gerrit.wikimedia.org/r/799342 [10:29:37] <_joe_> tgr, kostajh your patch is already deployed to mw-on-k8s [10:29:44] <_joe_> you can test it there if you like :) [10:30:00] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:30:05] I got myself confused, wmf.13 *is* the right place [10:30:17] the patch is just not having any effect [10:30:48] tgr: I need to step away for a bit [10:31:25] !log adding restbase2027-b to cassandra cluster [10:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:31:40] RECOVERY - cassandra-b SSL 10.192.48.183:7001 on restbase2027 is OK: SSL OK - Certificate restbase2027-b valid until 2024-05-29 16:33:48 +0000 (expires in 729 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [10:31:43] kostajh: ack, thanks for your help [10:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:46] (03Merged) 10jenkins-bot: eventgate-analytics: Bump 2022-05-30-145633-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/801409 (https://phabricator.wikimedia.org/T306181) (owner: 10Alexandros Kosiaris) [10:32:22] (03CR) 10Jbond: "thanks updated" [puppet] - 10https://gerrit.wikimedia.org/r/801648 (https://phabricator.wikimedia.org/T236379) (owner: 10Jbond) [10:32:33] (03PS4) 10Jbond: naggen: only apply alias injection for hosts [puppet] - 10https://gerrit.wikimedia.org/r/801648 (https://phabricator.wikimedia.org/T236379) [10:32:34] RECOVERY - cassandra-b service on restbase2027 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:34:51] (03PS1) 10Jcrespo: Prepare for release 0.8.2 [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/801656 [10:34:54] (03PS1) 10Jcrespo: check: Split common functionality to a WMFMetrics class [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/801657 (https://phabricator.wikimedia.org/T283017) [10:35:31] (03CR) 10CI reject: [V: 04-1] check: Split common functionality to a WMFMetrics class [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/801657 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [10:35:45] (03CR) 10Giuseppe Lavagetto: "Patch LGMT but there is an issue with dependencies on stretch IIRC." [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/801646 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi) [10:38:18] (03PS2) 10Jcrespo: Prepare for release 0.8.2 [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/801656 [10:38:52] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:40:56] (03PS2) 10Jcrespo: check: Split common functionality to a WMFMetrics class [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/801657 (https://phabricator.wikimedia.org/T283017) [10:41:18] (03PS3) 10Jcrespo: check: Split common functionality to a WMFMetrics class [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/801657 (https://phabricator.wikimedia.org/T283017) [10:41:32] RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:41:48] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:41:49] (03PS3) 10Jcrespo: Prepare for release 0.8.2 [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/801656 [10:41:51] (03CR) 10CI reject: [V: 04-1] check: Split common functionality to a WMFMetrics class [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/801657 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [10:42:29] (03CR) 10CI reject: [V: 04-1] Prepare for release 0.8.2 [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/801656 (owner: 10Jcrespo) [10:43:16] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:43:47] (03PS4) 10Jcrespo: check: Split common functionality to a WMFMetrics class [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/801657 (https://phabricator.wikimedia.org/T283017) [10:44:22] (03CR) 10CI reject: [V: 04-1] check: Split common functionality to a WMFMetrics class [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/801657 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [10:48:01] (03PS2) 10Hnowlan: install_server: add reimage role for sessionstore [puppet] - 10https://gerrit.wikimedia.org/r/770984 (https://phabricator.wikimedia.org/T303833) [10:48:44] (03CR) 10CI reject: [V: 04-1] install_server: add reimage role for sessionstore [puppet] - 10https://gerrit.wikimedia.org/r/770984 (https://phabricator.wikimedia.org/T303833) (owner: 10Hnowlan) [10:49:11] (03PS5) 10Jcrespo: check: Split common functionality to a WMFMetrics class [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/801657 (https://phabricator.wikimedia.org/T283017) [10:49:43] (03CR) 10CI reject: [V: 04-1] check: Split common functionality to a WMFMetrics class [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/801657 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [10:49:53] (03CR) 10Hnowlan: [C: 03+2] kartotherian: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/801639 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [10:50:04] (03CR) 10Hnowlan: [C: 03+1] kartotherian: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/801639 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [10:53:16] (03CR) 10Jcrespo: "Adding Alex, which I think he wrote most of these?" [puppet] - 10https://gerrit.wikimedia.org/r/801631 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [10:53:19] 10SRE-swift-storage, 10Infrastructure-Foundations: Poweredge R730xd, R740xd, R740xd2 SSDs not visible to OS as SSDs - https://phabricator.wikimedia.org/T309027 (10Volans) `ms-be2066` is back online. I've converted almost all via redfish in an automated way, but there is one bit to set the boot disk that so far... [10:55:36] (03CR) 10Volans: [C: 03+1] "LGTM, if possible test it in isolation with the existing exported resources to see the diff." [puppet] - 10https://gerrit.wikimedia.org/r/801648 (https://phabricator.wikimedia.org/T236379) (owner: 10Jbond) [10:56:26] (03PS1) 10Jbond: site.pp: add netboxdb[12]002 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/801658 (https://phabricator.wikimedia.org/T296452) [10:56:35] (03PS1) 10Gergő Tisza: Revert "Tombstone the old session on SessionBackend::resetId()" [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/801204 [10:56:47] (03CR) 10Jbond: [C: 03+2] site.pp: add netboxdb[12]002 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/801658 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [10:57:03] (03CR) 10Gergő Tisza: [C: 03+2] Revert "Tombstone the old session on SessionBackend::resetId()" [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/801204 (owner: 10Gergő Tisza) [10:57:29] _joe_: I'm about to sync, should I do anything special? [10:57:50] <_joe_> tgr: no just tell me when the canaries are done [10:58:45] _joe_: done [10:58:57] ..actually not done [10:59:07] now it's done [10:59:39] (sorry there's a "Finished Canary Endpoint Check Complete" log line, but there are more canary checks after that) [11:00:00] <_joe_> tgr: yeah that was enough, clearly there's a regression in scap I have to address [11:00:04] <_joe_> but not an issue for you right now [11:01:41] !log tgr@deploy1002 Synchronized php-1.39.0-wmf.13/includes/session: Backport: [[gerrit:799388|Tombstone the old session on SessionBackend::resetId() (T299193)]] (duration: 03m 12s) [11:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:49] T299193: MediaWiki login failure due to race condition with session cookie - https://phabricator.wikimedia.org/T299193 [11:01:50] maybe they are restarted as part of the generic php-fpm-restarts? that only finished now [11:01:52] tgr: back. you're reverting it? [11:01:59] <_joe_> tgr: yes [11:02:03] kostajh: no, just synced [11:02:18] I'm reverting on wmf.12 because that was pointless [11:02:28] ack [11:03:12] (03PS6) 10Jcrespo: check: Split common functionality to a WMFMetrics class [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/801657 (https://phabricator.wikimedia.org/T283017) [11:03:16] I'll see if I can get it to actually work. If it does not do anything, probably best to revert it on master as well. Not urgent though. [11:05:12] 10SRE, 10Infrastructure-Foundations, 10netops: DHCPd: update config to log more info - https://phabricator.wikimedia.org/T309524 (10cmooney) I agree @jbond it would be useful to have more granular detail. When we don't have a "match" on the dhcp snippet then we end up with a log like this: ` DHCPDISCOVER fr... [11:06:35] (03PS2) 10Muehlenhoff: kartotherian: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/801639 (https://phabricator.wikimedia.org/T308013) [11:09:21] (03PS7) 10Jcrespo: check: Split common functionality to a WMFMetrics class [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/801657 (https://phabricator.wikimedia.org/T283017) [11:11:28] (03PS3) 10Hnowlan: install_server: add reimage role for sessionstore [puppet] - 10https://gerrit.wikimedia.org/r/770984 (https://phabricator.wikimedia.org/T303833) [11:11:30] (03PS4) 10Jcrespo: Prepare for release 0.8.2 [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/801656 [11:11:49] (03PS5) 10Jcrespo: Prepare for release 0.8.2 [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/801656 [11:12:20] (03PS1) 10Jbond: netboxdb: add hiera and dhcpd config for netboxdb [puppet] - 10https://gerrit.wikimedia.org/r/801659 (https://phabricator.wikimedia.org/T296452) [11:13:23] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35622/console" [puppet] - 10https://gerrit.wikimedia.org/r/801659 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [11:14:01] (03CR) 10Jbond: [V: 03+1 C: 03+2] netboxdb: add hiera and dhcpd config for netboxdb [puppet] - 10https://gerrit.wikimedia.org/r/801659 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [11:14:41] (03PS6) 10Jcrespo: Prepare for release 0.8.2 [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/801656 [11:16:38] (03CR) 10Muehlenhoff: P:aptrepo::private refactor code to allow for multiple repositories. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799340 (owner: 10Slyngshede) [11:17:05] (03CR) 10Muehlenhoff: [C: 03+2] kartotherian: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/801639 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:17:15] (03Merged) 10jenkins-bot: Revert "Tombstone the old session on SessionBackend::resetId()" [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/801204 (owner: 10Gergő Tisza) [11:19:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10cmooney) @nskaggs / @dcaro, just an observation I'd missed before on this task: ` cloudnet1005 C8 U37 Cableid 20220119, 20220120 Port 1, 2 (cloud... [11:19:54] (03CR) 10Muehlenhoff: [C: 03+2] python_deploy: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/801641 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:21:57] (03PS1) 10Jbond: dhcpd: add mac address for netboxdb2002 [puppet] - 10https://gerrit.wikimedia.org/r/801660 (https://phabricator.wikimedia.org/T296452) [11:22:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [11:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:26] (03CR) 10Jbond: [C: 03+2] dhcpd: add mac address for netboxdb2002 [puppet] - 10https://gerrit.wikimedia.org/r/801660 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [11:22:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [11:22:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [11:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [11:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:01] (03PS8) 10Slyngshede: P:aptrepo::private refactor code to allow for multiple repositories. [puppet] - 10https://gerrit.wikimedia.org/r/799340 [11:24:12] (03CR) 10Slyngshede: P:aptrepo::private refactor code to allow for multiple repositories. (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/799340 (owner: 10Slyngshede) [11:24:21] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:24:54] (03CR) 10CI reject: [V: 04-1] P:aptrepo::private refactor code to allow for multiple repositories. [puppet] - 10https://gerrit.wikimedia.org/r/799340 (owner: 10Slyngshede) [11:25:23] (03CR) 10Majavah: [C: 03+1] "seems to work locally" [puppet] - 10https://gerrit.wikimedia.org/r/799982 (owner: 10Jbond) [11:26:18] (03PS9) 10Slyngshede: P:aptrepo::private refactor code to allow for multiple repositories. [puppet] - 10https://gerrit.wikimedia.org/r/799340 [11:28:51] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:30:49] !log jnuche@deploy1002 install-world aborted: (duration: 00m 01s) [11:30:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:09] !log jnuche@deploy1002 Installing scap version "4.8.0" [11:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:29] !log jnuche@deploy1002 Installation of scap version "4.8.0" completed for 524 hosts [11:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:33] ^^^ please ignore the scap messages, they are tests [11:33:25] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:34:59] (03CR) 10Muehlenhoff: [C: 03+1] "Ship it :-)" [puppet] - 10https://gerrit.wikimedia.org/r/799340 (owner: 10Slyngshede) [11:37:04] (03CR) 10Slyngshede: [C: 03+2] P:aptrepo::private refactor code to allow for multiple repositories. [puppet] - 10https://gerrit.wikimedia.org/r/799340 (owner: 10Slyngshede) [11:37:15] PROBLEM - puppet last run on kubetcd2004 is CRITICAL: CRITICAL: Puppet has been disabled for 605171 seconds, message: reboot puppet master/dbs, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:38:01] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:38:47] (03PS7) 10Jcrespo: Prepare for release 0.8.2 [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/801656 [11:38:54] I'm re-enabling puppet on kubetcd2004, seems that server was unreachable when puppet was re-enabled after the puppetmaster/puppetdb reboots [11:39:02] (03PS1) 10Ladsgroup: Migrate zhwiki to read new for templatelinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801661 (https://phabricator.wikimedia.org/T306673) [11:39:53] 10SRE, 10SectionTranslation, 10Language-Team (Language-2022-April-June): Deploy cxserver db password in private puppet repository - https://phabricator.wikimedia.org/T309486 (10KartikMistry) [11:40:06] (03PS8) 10Jcrespo: check: Split common functionality to a WMFMetrics class [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/801657 (https://phabricator.wikimedia.org/T283017) [11:41:32] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [11:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:52] 10SRE: a couple longer running icinga alerts to be fixed - https://phabricator.wikimedia.org/T309257 (10MoritzMuehlenhoff) p:05Triage→03Medium [11:41:56] 10SRE: a couple longer running icinga alerts to be fixed - https://phabricator.wikimedia.org/T309257 (10MoritzMuehlenhoff) I don't think one big meta task will work out, it'll show up in too many workboards (even if e.g. some bits are done) and there's also the issue that a task can only have one assignee. So I... [11:42:12] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [11:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:20] (03CR) 10Jbond: naggen: only apply alias injection for hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/801648 (https://phabricator.wikimedia.org/T236379) (owner: 10Jbond) [11:43:31] RECOVERY - puppet last run on kubetcd2004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:44:13] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [11:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:21] (03PS1) 10Elukey: Move the ml-staging cluster under ml-serve's definition [puppet] - 10https://gerrit.wikimedia.org/r/801662 (https://phabricator.wikimedia.org/T302195) [11:44:46] (JobUnavailable) firing: Reduced availability for job swagger_check_restbase_eqsin in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:44:51] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:45:08] (03CR) 10Jbond: [C: 03+2] naggen: only apply alias injection for hosts [puppet] - 10https://gerrit.wikimedia.org/r/801648 (https://phabricator.wikimedia.org/T236379) (owner: 10Jbond) [11:45:33] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [11:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:13] (03PS1) 10KartikMistry: Update cxserver to 2022-05-31-111430-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/801663 (https://phabricator.wikimedia.org/T306963) [11:49:00] (03PS1) 10Kevin Bazira: ml-services: add glwiki & nlwiki articlequality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/801664 (https://phabricator.wikimedia.org/T307418) [11:49:20] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [11:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:45] (JobUnavailable) resolved: Reduced availability for job swagger_check_restbase_eqsin in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:49:59] (03PS1) 10Majavah: move more nrpe checks to nrpe::plugin and sudo_user [puppet] - 10https://gerrit.wikimedia.org/r/801665 [11:50:04] (03CR) 10Jcrespo: [C: 03+2] check: Split common functionality to a WMFMetrics class [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/801657 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [11:50:37] (03CR) 10Jcrespo: [C: 03+2] Prepare for release 0.8.2 [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/801656 (owner: 10Jcrespo) [11:50:39] (03PS1) 10Slyngshede: P::aptrepo::private, fix parameters for aptrepo::repo [puppet] - 10https://gerrit.wikimedia.org/r/801686 [11:50:41] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [11:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:50] (03PS8) 10Jcrespo: Prepare for release 0.8.2 [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/801656 [11:52:35] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35623/console" [puppet] - 10https://gerrit.wikimedia.org/r/801665 (owner: 10Majavah) [11:52:53] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:54:42] (03CR) 10Elukey: [C: 03+2] ml-services: add glwiki & nlwiki articlequality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/801664 (https://phabricator.wikimedia.org/T307418) (owner: 10Kevin Bazira) [11:56:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [11:56:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [11:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:29] 10SRE, 10Infrastructure-Foundations, 10netops: DHCPd: update config to log more info - https://phabricator.wikimedia.org/T309524 (10jbond) Thanks for looking at this @Volans @cmooney > Because that's a valid hostname in our DNS it would have just used that IP. So not sure how to "prevent" this. Doh! > It... [11:57:58] (03CR) 10Jbond: [C: 03+2] Rakefie: Add URI.escape monkey patch [puppet] - 10https://gerrit.wikimedia.org/r/799982 (owner: 10Jbond) [11:59:17] kart_: do you need any help on merging patches in puppet or putting the password in the private repo? [11:59:26] (for the new cx db) [11:59:44] Amir1: https://phabricator.wikimedia.org/T309486 - since I can't do it myself :) [11:59:57] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/801686 (owner: 10Slyngshede) [12:00:02] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [12:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:17] I will do it [12:00:27] (03CR) 10Slyngshede: [C: 03+2] P::aptrepo::private, fix parameters for aptrepo::repo [puppet] - 10https://gerrit.wikimedia.org/r/801686 (owner: 10Slyngshede) [12:00:34] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [12:00:35] Thanks! You know where to get the password too :) [12:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:51] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:00:56] where I put it? :P [12:01:07] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/801686 (owner: 10Slyngshede) [12:01:51] (03CR) 10Klausman: [C: 03+1] Move the ml-staging cluster under ml-serve's definition [puppet] - 10https://gerrit.wikimedia.org/r/801662 (https://phabricator.wikimedia.org/T302195) (owner: 10Elukey) [12:02:03] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb={CREATE,PATCH} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [12:02:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [12:02:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [12:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:31] PROBLEM - netbox HTTPS on netbox1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 312 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Netbox [12:05:52] (03PS1) 10Jbond: spdx: correct typo support vs Supoport [puppet] - 10https://gerrit.wikimedia.org/r/801689 [12:06:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1100.eqiad.wmnet with reason: Maintenance [12:06:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1100.eqiad.wmnet with reason: Maintenance [12:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:50] (03CR) 10Volans: [C: 03+1] "LGTM, thx" [puppet] - 10https://gerrit.wikimedia.org/r/801689 (owner: 10Jbond) [12:07:59] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [12:08:39] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host idp2002.wikimedia.org [12:08:41] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:13] (03PS1) 10Slyngshede: aptrepo::repo syntax fixes. [puppet] - 10https://gerrit.wikimedia.org/r/801690 [12:12:28] (03CR) 10Slyngshede: [C: 03+2] aptrepo::repo syntax fixes. [puppet] - 10https://gerrit.wikimedia.org/r/801690 (owner: 10Slyngshede) [12:13:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:13:16] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache idp2002.wikimedia.org on all recursors [12:13:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) idp2002.wikimedia.org on all recursors [12:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:30] (03PS1) 10Alexandros Kosiaris: scap: Allow rsync from analytics hosts too [puppet] - 10https://gerrit.wikimedia.org/r/801691 (https://phabricator.wikimedia.org/T307081) [12:17:19] (03PS1) 10Volans: sre.swift.convert-ssds: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/801693 (https://phabricator.wikimedia.org/T309027) [12:18:50] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/801691 (https://phabricator.wikimedia.org/T307081) (owner: 10Alexandros Kosiaris) [12:18:54] (03PS1) 10Giuseppe Lavagetto: Revert "mediawiki_canaries: disable opcache revalidation" [puppet] - 10https://gerrit.wikimedia.org/r/801670 [12:19:35] (03PS1) 10Jbond: site.pp: promote netboxdb[12]002 hosts to netbox::db [puppet] - 10https://gerrit.wikimedia.org/r/801694 (https://phabricator.wikimedia.org/T296452) [12:19:45] (03CR) 10CI reject: [V: 04-1] Revert "mediawiki_canaries: disable opcache revalidation" [puppet] - 10https://gerrit.wikimedia.org/r/801670 (owner: 10Giuseppe Lavagetto) [12:19:52] (03CR) 10Jbond: [C: 03+1] spdx: correct typo support vs Supoport [puppet] - 10https://gerrit.wikimedia.org/r/801689 (owner: 10Jbond) [12:19:57] (03CR) 10Jbond: [C: 03+2] site.pp: promote netboxdb[12]002 hosts to netbox::db [puppet] - 10https://gerrit.wikimedia.org/r/801694 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [12:20:31] 10SRE, 10Infrastructure-Foundations, 10netops: Cannot verify NTP status asw1-b12-drmrs - https://phabricator.wikimedia.org/T305840 (10cmooney) 05Open→03Resolved a:03cmooney After a bit of back-and-forth with Juniper they eventually suggests just killing the ntpd process from a root shell. Which has do... [12:20:59] (03CR) 10Jbond: [C: 03+2] spdx: correct typo support vs Supoport [puppet] - 10https://gerrit.wikimedia.org/r/801689 (owner: 10Jbond) [12:21:04] <_joe_> jouncebot: next [12:21:04] In 0 hour(s) and 38 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220531T1300) [12:21:09] (03CR) 10Volans: "Initial version of a potential cookbook to automate the process." [cookbooks] - 10https://gerrit.wikimedia.org/r/801693 (https://phabricator.wikimedia.org/T309027) (owner: 10Volans) [12:21:24] (03CR) 10Cathal Mooney: [C: 03+2] Add new per-rack cloudsw subnets for e4 and f4 to networks data [puppet] - 10https://gerrit.wikimedia.org/r/800730 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [12:21:57] (03PS2) 10Giuseppe Lavagetto: Revert "mediawiki_canaries: disable opcache revalidation" [puppet] - 10https://gerrit.wikimedia.org/r/801670 [12:22:43] (03PS2) 10KartikMistry: Update cxserver to 2022-05-31-111430-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/801663 (https://phabricator.wikimedia.org/T306963) [12:22:45] (03CR) 10Cathal Mooney: [C: 03+2] Install server changes to support new subnets cloud racks c8 and d5 [puppet] - 10https://gerrit.wikimedia.org/r/800731 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [12:22:52] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.3 point update - https://phabricator.wikimedia.org/T304599 (10MoritzMuehlenhoff) [12:23:24] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.3 point update - https://phabricator.wikimedia.org/T304599 (10MoritzMuehlenhoff) [12:24:24] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Revert "mediawiki_canaries: disable opcache revalidation" [puppet] - 10https://gerrit.wikimedia.org/r/801670 (owner: 10Giuseppe Lavagetto) [12:24:52] (03PS1) 10MVernon: swift: remove failed node ms-be1059, mark a dead drive failed [puppet] - 10https://gerrit.wikimedia.org/r/801695 (https://phabricator.wikimedia.org/T307667) [12:25:25] <_joe_> topranks: can I merge your patch too? [12:25:44] (03CR) 10CI reject: [V: 04-1] swift: remove failed node ms-be1059, mark a dead drive failed [puppet] - 10https://gerrit.wikimedia.org/r/801695 (https://phabricator.wikimedia.org/T307667) (owner: 10MVernon) [12:25:46] _joe_: sorry yep please do [12:25:49] thanks [12:26:02] <_joe_> topranks: done! [12:26:13] (03PS3) 10KartikMistry: Update cxserver to 2022-05-31-111430-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/801663 (https://phabricator.wikimedia.org/T306963) [12:27:21] (03PS2) 10MVernon: swift: remove failed node ms-be1059, mark a dead drive failed [puppet] - 10https://gerrit.wikimedia.org/r/801695 (https://phabricator.wikimedia.org/T307667) [12:28:02] (03PS1) 10Slyngshede: P:aptrepo::private, incorrect variable name in template. [puppet] - 10https://gerrit.wikimedia.org/r/801697 [12:29:38] (03CR) 10Alexandros Kosiaris: [C: 03+2] scap: Allow rsync from analytics hosts too [puppet] - 10https://gerrit.wikimedia.org/r/801691 (https://phabricator.wikimedia.org/T307081) (owner: 10Alexandros Kosiaris) [12:29:52] 10SRE, 10SectionTranslation, 10Language-Team (Language-2022-April-June): Deploy cxserver db password in private puppet repository - https://phabricator.wikimedia.org/T309486 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup I did it to reduce the burden on clinic duty. The non-db config is wrong though bu... [12:30:05] 10SRE-OnFire, 10Wikidata, 10wdwb-tech, 10Discovery-Search (Current work), and 3 others: Only generate maxlag from pooled query service servers. - https://phabricator.wikimedia.org/T238751 (10Addshore) I believe that this will do the right thing! [12:30:08] (03CR) 10Addshore: [C: 03+1] maintenance::wikidata: Update cron with lb and lb-pool params [puppet] - 10https://gerrit.wikimedia.org/r/797077 (https://phabricator.wikimedia.org/T238751) (owner: 10Giuseppe Lavagetto) [12:31:02] PROBLEM - Check systemd state on ml-serve1007 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:10] (03CR) 10Slyngshede: [C: 03+2] P:aptrepo::private, incorrect variable name in template. [puppet] - 10https://gerrit.wikimedia.org/r/801697 (owner: 10Slyngshede) [12:32:06] (03PS2) 10Volans: sre.swift.convert-ssds: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/801693 (https://phabricator.wikimedia.org/T309027) [12:32:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host idp2002.wikimedia.org [12:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:40] (03CR) 10Volans: "added polling for power off" [cookbooks] - 10https://gerrit.wikimedia.org/r/801693 (https://phabricator.wikimedia.org/T309027) (owner: 10Volans) [12:33:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [12:33:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [12:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [12:35:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [12:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:08] PROBLEM - Check whether ferm is active by checking the default input chain on ml-serve1007 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:36:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [12:36:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [12:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:06] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/801695 (https://phabricator.wikimedia.org/T307667) (owner: 10MVernon) [12:37:24] jouncebot: nowandnext [12:37:24] No deployments scheduled for the next 0 hour(s) and 22 minute(s) [12:37:24] In 0 hour(s) and 22 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220531T1300) [12:37:30] awesome [12:37:43] (03CR) 10MVernon: [C: 03+2] swift: remove failed node ms-be1059, mark a dead drive failed [puppet] - 10https://gerrit.wikimedia.org/r/801695 (https://phabricator.wikimedia.org/T307667) (owner: 10MVernon) [12:37:45] (03CR) 10Ladsgroup: [C: 03+2] Migrate zhwiki to read new for templatelinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801661 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup) [12:37:50] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:37:57] (03CR) 10Elukey: [C: 03+2] Move the ml-staging cluster under ml-serve's definition [puppet] - 10https://gerrit.wikimedia.org/r/801662 (https://phabricator.wikimedia.org/T302195) (owner: 10Elukey) [12:38:27] (03CR) 10Filippo Giunchedi: [C: 03+2] graphite: migrate update_graphite_index cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/779022 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [12:38:29] (03Merged) 10jenkins-bot: Migrate zhwiki to read new for templatelinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801661 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup) [12:39:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [12:39:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [12:39:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:43] (03PS1) 10Slyngshede: aptrepo::repo add missing distribution file [puppet] - 10https://gerrit.wikimedia.org/r/801700 [12:41:13] (03CR) 10Slyngshede: [C: 03+2] aptrepo::repo add missing distribution file [puppet] - 10https://gerrit.wikimedia.org/r/801700 (owner: 10Slyngshede) [12:41:15] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: migrate sync-icinga-state cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/780671 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [12:41:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1100.eqiad.wmnet with reason: Maintenance [12:41:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1100.eqiad.wmnet with reason: Maintenance [12:41:23] (03PS3) 10Filippo Giunchedi: icinga: migrate sync-icinga-state cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/780671 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [12:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:38] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:801661|Migrate zhwiki to read new for templatelinks (T306673)]] (duration: 03m 10s) [12:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:44] T306673: Turn on read new for templatelinks on beta and production - https://phabricator.wikimedia.org/T306673 [12:44:14] (03PS2) 10Ori: service::docker: refresh service when config file is changed [puppet] - 10https://gerrit.wikimedia.org/r/799420 [12:44:25] (03CR) 10Ori: service::docker: refresh service when config file is changed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799420 (owner: 10Ori) [12:44:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [12:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [12:45:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [12:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:40] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:46:45] (03PS1) 10Slyngshede: profile::aptrepo::private remove duplicate file. [puppet] - 10https://gerrit.wikimedia.org/r/801702 [12:47:09] (03CR) 10David Caro: "LGTM, though will give an opportunity for someone that knows better depoyment-prep to have a look" [puppet] - 10https://gerrit.wikimedia.org/r/799861 (https://phabricator.wikimedia.org/T309281) (owner: 10Majavah) [12:47:37] (03PS4) 10KartikMistry: Update cxserver to 2022-05-31-123738-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/801663 (https://phabricator.wikimedia.org/T306963) [12:48:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [12:48:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [12:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T309311)', diff saved to https://phabricator.wikimedia.org/P29193 and previous config saved to /var/cache/conftool/dbconfig/20220531-124807-ladsgroup.json [12:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:16] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [12:48:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [12:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:45] (03CR) 10David Caro: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/799845 (https://phabricator.wikimedia.org/T309281) (owner: 10Majavah) [12:48:47] !log jbond@deploy1002 Started deploy [netbox/deploy@7bbf659]: deploying v2.10.4-wmf6 to new hosts [12:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:51] !log jbond@deploy1002 Finished deploy [netbox/deploy@7bbf659]: deploying v2.10.4-wmf6 to new hosts (duration: 01m 04s) [12:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:11] (03CR) 10Slyngshede: [C: 03+2] profile::aptrepo::private remove duplicate file. [puppet] - 10https://gerrit.wikimedia.org/r/801702 (owner: 10Slyngshede) [12:51:06] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: remove absented sync-icinga-state cron [puppet] - 10https://gerrit.wikimedia.org/r/780672 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [12:51:18] (03PS2) 10Filippo Giunchedi: icinga: remove absented sync-icinga-state cron [puppet] - 10https://gerrit.wikimedia.org/r/780672 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [12:51:35] RECOVERY - cassandra-b CQL 10.192.48.183:9042 on restbase2027 is OK: TCP OK - 0.033 second response time on 10.192.48.183 port 9042 https://phabricator.wikimedia.org/T93886 [12:54:00] zabe: finally got around merging your patches for icinga/graphite timer conversion, thank you again [12:57:16] (03CR) 10Tchanders: [C: 03+1] Assign similareditors right to the checkuser group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799378 (https://phabricator.wikimedia.org/T307205) (owner: 10AGueyte) [12:58:17] PROBLEM - Check systemd state on kubernetes1013 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:58:34] (03CR) 10MVernon: "Thanks for making headway on this, much appreciated :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/801693 (https://phabricator.wikimedia.org/T309027) (owner: 10Volans) [12:59:29] RECOVERY - Check systemd state on ml-serve1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:59:42] !log killed kowiki's refreshLinkRecommendations.php (T299021) [12:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:50] T299021: Shorten running time of refreshLinkRecommendations.php - https://phabricator.wikimedia.org/T299021 [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220531T1300). [13:00:04] koi: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:07] (03PS1) 10David Caro: codfw1dev,wmcs: Add labtest/wmcs-roots to the admin groups [puppet] - 10https://gerrit.wikimedia.org/r/801704 [13:01:01] (03PS1) 10Slyngshede: aptrepo::repo move validation command to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/801705 [13:01:30] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] "If you want this merged instead of I5e13f75d2e (since you added it to the backport window), I’d like to have some explanation why it’s bet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801403 (https://phabricator.wikimedia.org/T309544) (owner: 10Stang) [13:01:43] I can’t deploy today, sorry [13:02:25] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:03:03] oh sorry I didn't see that patch at that time [13:03:11] ah, okay ^^ [13:03:28] I was very surprised when I was done uploading my change and saw that the Phabricator task had already received a gerritbot comment in the meantime :D [13:03:31] you were very fast! [13:03:56] (03Abandoned) 10Stang: enwiki: Regenerate inconsistent logo-1x [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801403 (https://phabricator.wikimedia.org/T309544) (owner: 10Stang) [13:04:46] I can deploy today [13:04:52] (I would also lean towards leaving my Gerrit change open for a few more days, changing the enwiki.png feels like a big change ^^) [13:04:57] taavi: thanks [13:05:43] (03PS2) 10Stang: zhwiktionary: Create namespace "Thesaurus" and "Citations" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801420 (https://phabricator.wikimedia.org/T309564) [13:06:53] RECOVERY - Check whether ferm is active by checking the default input chain on ml-serve1007 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:07:01] (03CR) 10Majavah: [C: 03+2] zhwiktionary: Create namespace "Thesaurus" and "Citations" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801420 (https://phabricator.wikimedia.org/T309564) (owner: 10Stang) [13:07:49] (03Merged) 10jenkins-bot: zhwiktionary: Create namespace "Thesaurus" and "Citations" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801420 (https://phabricator.wikimedia.org/T309564) (owner: 10Stang) [13:07:53] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [13:08:19] koi: please test on mwdebug1001 [13:09:17] looking [13:09:20] (03PS1) 10Jbond: netbox: Add discovery name to django allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/801713 (https://phabricator.wikimedia.org/T296452) [13:10:15] RECOVERY - netbox HTTPS on netbox1002 is OK: HTTP OK: HTTP/1.1 302 Found - 450 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Netbox [13:10:33] LGTM! [13:10:59] ok, syncing [13:11:09] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/801705 (owner: 10Slyngshede) [13:12:18] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35624/console" [puppet] - 10https://gerrit.wikimedia.org/r/801713 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:12:44] hmm, scap is taking a while for php-fpm-restarts today [13:13:41] (03PS2) 10Jbond: netbox: Add discovery name to django allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/801713 (https://phabricator.wikimedia.org/T296452) [13:13:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:00] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:801420|zhwiktionary: Create namespace "Thesaurus" and "Citations" (T309564)]] (duration: 02m 56s) [13:14:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:06] T309564: Create namespace "Thesaurus" and "Citations" for zhwiktionary - https://phabricator.wikimedia.org/T309564 [13:14:18] running namespaceDupes.php now [13:14:36] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35625/console" [puppet] - 10https://gerrit.wikimedia.org/r/801713 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:14:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:14:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:18] !log taavi@mwmaint1002 ~ $ mwscript namespaceDupes.php --wiki zhwiktionary --fix # T309564 [13:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:31] koi: all done I think [13:15:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:15:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:46] thx! [13:15:50] (03CR) 10Ssingh: [C: 03+1] "LGTM, thank you for the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/801633 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:17:04] (03CR) 10Jbond: [V: 03+1 C: 03+2] netbox: Add discovery name to django allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/801713 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:21:29] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1013 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:21:55] (03PS3) 10Jbond: O:cache::text: Move netbox to the cacheing infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/791589 (https://phabricator.wikimedia.org/T296452) [13:22:12] (03PS1) 10Filippo Giunchedi: blackbox: add IRC probe module [puppet] - 10https://gerrit.wikimedia.org/r/801714 (https://phabricator.wikimedia.org/T305847) [13:22:28] (03CR) 10CI reject: [V: 04-1] O:cache::text: Move netbox to the cacheing infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/791589 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:23:01] (03PS5) 10Jbond: services: Add DNS discovery record for netbox [puppet] - 10https://gerrit.wikimedia.org/r/791588 (https://phabricator.wikimedia.org/T296452) [13:23:12] (03PS4) 10Jbond: O:cache::text: Move netbox to the cacheing infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/791589 (https://phabricator.wikimedia.org/T296452) [13:24:42] (03CR) 10Itamar Givon: [C: 03+1] "Looks good to me, after a quick chat w. Adam" [puppet] - 10https://gerrit.wikimedia.org/r/797077 (https://phabricator.wikimedia.org/T238751) (owner: 10Giuseppe Lavagetto) [13:24:57] RECOVERY - Check systemd state on kubernetes1013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:25:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1109.eqiad.wmnet with reason: Maintenance [13:25:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1109.eqiad.wmnet with reason: Maintenance [13:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1109 (T60674)', diff saved to https://phabricator.wikimedia.org/P29194 and previous config saved to /var/cache/conftool/dbconfig/20220531-132530-ladsgroup.json [13:25:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:40] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [13:25:42] (03PS2) 10Jbond: netbox: create discovery record for netbox [dns] - 10https://gerrit.wikimedia.org/r/791586 (https://phabricator.wikimedia.org/T296452) [13:26:53] (03CR) 10Jbond: [C: 03+2] services: Add DNS discovery record for netbox [puppet] - 10https://gerrit.wikimedia.org/r/791588 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:26:56] (03CR) 10Jbond: [C: 03+2] O:cache::text: Move netbox to the cacheing infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/791589 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:27:59] (03PS1) 10Jbond: Revert "O:cache::text: Move netbox to the cacheing infrastructure" [puppet] - 10https://gerrit.wikimedia.org/r/801675 [13:28:02] (03PS1) 10Jbond: Revert "services: Add DNS discovery record for netbox" [puppet] - 10https://gerrit.wikimedia.org/r/801676 [13:28:17] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "O:cache::text: Move netbox to the cacheing infrastructure" [puppet] - 10https://gerrit.wikimedia.org/r/801675 (owner: 10Jbond) [13:28:22] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "services: Add DNS discovery record for netbox" [puppet] - 10https://gerrit.wikimedia.org/r/801676 (owner: 10Jbond) [13:29:24] (03PS1) 10Jbond: Revert^2 "services: Add DNS discovery record for netbox" [puppet] - 10https://gerrit.wikimedia.org/r/801677 [13:29:27] (03PS1) 10Jbond: Revert "Revert "O:cache::text: Move netbox to the cacheing infra..." [puppet] - 10https://gerrit.wikimedia.org/r/801678 [13:31:21] (03PS2) 10Jbond: services: Add DNS discovery record for netbox [puppet] - 10https://gerrit.wikimedia.org/r/801677 [13:31:36] (03PS2) 10Jbond: Revert "Revert "O:cache::text: Move netbox to the cacheing infra..." [puppet] - 10https://gerrit.wikimedia.org/r/801678 [13:32:24] (03PS3) 10Jbond: O:cache::text: Move netbox to the cacheing infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/801678 (https://phabricator.wikimedia.org/T296452) [13:32:36] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35626/console" [puppet] - 10https://gerrit.wikimedia.org/r/801678 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:32:51] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01888 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:33:19] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:33:45] * jbond looking at the puppet failurs [13:33:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [13:33:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [13:33:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T307525)', diff saved to https://phabricator.wikimedia.org/P29195 and previous config saved to /var/cache/conftool/dbconfig/20220531-133356-ladsgroup.json [13:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:08] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [13:35:01] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1004.wikimedia.org with OS bullseye [13:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:06] 10SRE, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye [13:37:43] (03CR) 10Jbond: [C: 03+2] services: Add DNS discovery record for netbox [puppet] - 10https://gerrit.wikimedia.org/r/801677 (owner: 10Jbond) [13:37:46] (03CR) 10Jbond: [C: 03+2] O:cache::text: Move netbox to the cacheing infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/801678 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:38:28] !log move ml-etcd100[1-3] from drdb to plain to investigate high k8s latencies for the control plane [13:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:11] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb={LIST,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:41:15] PROBLEM - Check systemd state on ms-be1055 is CRITICAL: CRITICAL - degraded: The following units failed: session-343844.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:41:18] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10hoo) [13:41:29] (03CR) 10Jbond: [C: 03+2] netbox: create discovery record for netbox [dns] - 10https://gerrit.wikimedia.org/r/791586 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:42:01] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.002041 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:43:02] (03PS1) 10Jbond: Revert "netbox: create discovery record for netbox" [dns] - 10https://gerrit.wikimedia.org/r/801679 [13:43:22] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "netbox: create discovery record for netbox" [dns] - 10https://gerrit.wikimedia.org/r/801679 (owner: 10Jbond) [13:43:55] (03PS1) 10Jbond: netbox: create discovery record for netbox [dns] - 10https://gerrit.wikimedia.org/r/801680 (https://phabricator.wikimedia.org/T296452) [13:44:47] PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:44:55] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:46:19] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:49:25] (03PS1) 10Muehlenhoff: Add DHCP record for idp2002 [puppet] - 10https://gerrit.wikimedia.org/r/801717 [13:50:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T307525)', diff saved to https://phabricator.wikimedia.org/P29197 and previous config saved to /var/cache/conftool/dbconfig/20220531-135022-ladsgroup.json [13:50:28] (03PS1) 10Jbond: netbox: configure dns discovery services to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/801718 (https://phabricator.wikimedia.org/T296452) [13:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:30] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [13:51:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T309311)', diff saved to https://phabricator.wikimedia.org/P29198 and previous config saved to /var/cache/conftool/dbconfig/20220531-135105-ladsgroup.json [13:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:11] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [13:51:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:51:33] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35627/console" [puppet] - 10https://gerrit.wikimedia.org/r/801718 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:51:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109 (T60674)', diff saved to https://phabricator.wikimedia.org/P29199 and previous config saved to /var/cache/conftool/dbconfig/20220531-135157-ladsgroup.json [13:52:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:03] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [13:52:43] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1013 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:57:25] RECOVERY - cassandra-c service on restbase2027 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:57:49] RECOVERY - cassandra-c SSL 10.192.48.184:7001 on restbase2027 is OK: SSL OK - Certificate restbase2027-c valid until 2024-05-29 16:33:51 +0000 (expires in 729 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [13:57:51] (03CR) 10Jbond: [V: 03+1 C: 03+2] netbox: configure dns discovery services to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/801718 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:58:57] (03PS2) 10Jbond: netbox: create discovery record for netbox [dns] - 10https://gerrit.wikimedia.org/r/801680 (https://phabricator.wikimedia.org/T296452) [14:01:03] (03CR) 10Jbond: [C: 03+2] netbox: create discovery record for netbox [dns] - 10https://gerrit.wikimedia.org/r/801680 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [14:03:06] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1004.wikimedia.org with OS bullseye [14:03:08] !log jbond@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=netbox,name=eqiad [14:03:11] 10SRE, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye executed with errors: - cloudela... [14:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:47] PROBLEM - Check systemd state on ms-be1061 is CRITICAL: CRITICAL - degraded: The following units failed: session-343748.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P29200 and previous config saved to /var/cache/conftool/dbconfig/20220531-140528-ladsgroup.json [14:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P29201 and previous config saved to /var/cache/conftool/dbconfig/20220531-140611-ladsgroup.json [14:06:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109', diff saved to https://phabricator.wikimedia.org/P29202 and previous config saved to /var/cache/conftool/dbconfig/20220531-140702-ladsgroup.json [14:07:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [14:12:09] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:13:51] (03PS1) 10Daniel Kinzler: EXPERIMENT: allow DB config reload [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801721 (https://phabricator.wikimedia.org/T298485) [14:14:52] (03CR) 10CI reject: [V: 04-1] EXPERIMENT: allow DB config reload [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801721 (https://phabricator.wikimedia.org/T298485) (owner: 10Daniel Kinzler) [14:16:01] (CirrusSearchHighOldGCFrequency) firing: (4) Elasticsearch instance cloudelastic1001-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [14:17:40] !log jbond@deploy1002 Started deploy [netbox/deploy@7bbf659]: deploying v2.10.4-wmf6 to new hosts [14:17:44] (03PS1) 10Gergő Tisza: Revert "Tombstone the old session on SessionBackend::resetId()" [core] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/801683 (https://phabricator.wikimedia.org/T299193) [14:17:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:19] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1006.wikimedia.org with OS bullseye [14:18:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:25] 10SRE, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1006.wikimedia.org with OS bullseye [14:18:28] jouncebot: next [14:18:28] In 1 hour(s) and 41 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220531T1600) [14:19:04] !log doing an emergency revert for T309616 [14:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:14] T309616: Cross-wiki session loss on Wikimedia wikis - https://phabricator.wikimedia.org/T309616 [14:19:35] (03CR) 10Gergő Tisza: [C: 03+2] Revert "Tombstone the old session on SessionBackend::resetId()" [core] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/801683 (https://phabricator.wikimedia.org/T299193) (owner: 10Gergő Tisza) [14:19:56] !log jbond@deploy1002 Finished deploy [netbox/deploy@7bbf659]: deploying v2.10.4-wmf6 to new hosts (duration: 02m 15s) [14:19:59] !log jbond@deploy1002 Started deploy [netbox/deploy@7bbf659]: deploying v2.10.4-wmf6 to new hosts [14:20:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:32] !log jbond@deploy1002 Finished deploy [netbox/deploy@7bbf659]: deploying v2.10.4-wmf6 to new hosts (duration: 00m 32s) [14:20:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P29204 and previous config saved to /var/cache/conftool/dbconfig/20220531-142033-ladsgroup.json [14:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P29205 and previous config saved to /var/cache/conftool/dbconfig/20220531-142116-ladsgroup.json [14:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:25] !log jbond@deploy1002 Started deploy [netbox/deploy@7bbf659]: deploying v2.10.4-wmf6 to new hosts [14:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109', diff saved to https://phabricator.wikimedia.org/P29206 and previous config saved to /var/cache/conftool/dbconfig/20220531-142207-ladsgroup.json [14:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:17] (03PS1) 10Filippo Giunchedi: ldap-corp: disable paging [puppet] - 10https://gerrit.wikimedia.org/r/801723 (https://phabricator.wikimedia.org/T244792) [14:23:17] !log jbond@deploy1002 Finished deploy [netbox/deploy@7bbf659]: deploying v2.10.4-wmf6 to new hosts (duration: 01m 52s) [14:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:23] !log jbond@deploy1002 Started deploy [netbox/deploy@7bbf659]: deploying v2.10.4-wmf6 to new hosts [14:23:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:24] !log jbond@deploy1002 Finished deploy [netbox/deploy@7bbf659]: deploying v2.10.4-wmf6 to new hosts (duration: 01m 01s) [14:24:27] !log jbond@deploy1002 Started deploy [netbox/deploy@7bbf659]: deploying v2.10.4-wmf6 to new hosts [14:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:19] !log jbond@deploy1002 Finished deploy [netbox/deploy@7bbf659]: deploying v2.10.4-wmf6 to new hosts (duration: 00m 52s) [14:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:44] (03PS1) 10Ladsgroup: mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/801684 (https://phabricator.wikimedia.org/T309617) [14:25:53] (03PS1) 10Ladsgroup: wmnet: Update s7-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/801685 (https://phabricator.wikimedia.org/T309617) [14:25:58] (03PS2) 10Ladsgroup: mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/801684 (https://phabricator.wikimedia.org/T309617) [14:26:06] (03CR) 10CI reject: [V: 04-1] wmnet: Update s7-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/801685 (https://phabricator.wikimedia.org/T309617) (owner: 10Ladsgroup) [14:27:14] (03PS2) 10Ladsgroup: wmnet: Update s7-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/801685 (https://phabricator.wikimedia.org/T309617) [14:33:22] (03PS1) 10Jbond: CONTRIBUTORS: Add Marius Hoch [puppet] - 10https://gerrit.wikimedia.org/r/801725 (https://phabricator.wikimedia.org/T308013) [14:34:43] (03CR) 10Jbond: [C: 03+2] CONTRIBUTORS: Add Marius Hoch [puppet] - 10https://gerrit.wikimedia.org/r/801725 (https://phabricator.wikimedia.org/T308013) (owner: 10Jbond) [14:35:22] (03Merged) 10jenkins-bot: Revert "Tombstone the old session on SessionBackend::resetId()" [core] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/801683 (https://phabricator.wikimedia.org/T299193) (owner: 10Gergő Tisza) [14:35:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T307525)', diff saved to https://phabricator.wikimedia.org/P29207 and previous config saved to /var/cache/conftool/dbconfig/20220531-143538-ladsgroup.json [14:35:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:46] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [14:36:10] (03PS1) 10Hnowlan: Add missing parameter to CalledProcessError [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/801728 [14:36:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T309311)', diff saved to https://phabricator.wikimedia.org/P29208 and previous config saved to /var/cache/conftool/dbconfig/20220531-143621-ladsgroup.json [14:36:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [14:36:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [14:36:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:28] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [14:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [14:37:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [14:37:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109 (T60674)', diff saved to https://phabricator.wikimedia.org/P29209 and previous config saved to /var/cache/conftool/dbconfig/20220531-143712-ladsgroup.json [14:37:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [14:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [14:37:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T307525)', diff saved to https://phabricator.wikimedia.org/P29210 and previous config saved to /var/cache/conftool/dbconfig/20220531-143716-ladsgroup.json [14:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3318 (T60674)', diff saved to https://phabricator.wikimedia.org/P29211 and previous config saved to /var/cache/conftool/dbconfig/20220531-143720-ladsgroup.json [14:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:26] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [14:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:10] (03PS1) 10David Caro: Fix spelling errors [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801730 [14:41:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:42:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:42:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:43:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:44:47] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:45:47] RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:46:22] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1006.wikimedia.org with OS bullseye [14:46:55] 10SRE, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1006.wikimedia.org with OS bullseye executed with errors: - cloudela... [14:47:05] !log tgr@deploy1002 Synchronized php-1.39.0-wmf.13/includes/session: Backport: [[gerrit:801683|Revert "Tombstone the old session on SessionBackend::resetId()" (T299193 T309616)]] (duration: 03m 08s) [14:47:05] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:48:24] (03CR) 10Samtar: "https://puppet-compiler.wmflabs.org/pcc-worker1001/35628/" [puppet] - 10https://gerrit.wikimedia.org/r/800855 (https://phabricator.wikimedia.org/T309449) (owner: 10Samtar) [14:49:05] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:49:31] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:50:09] RECOVERY - Check systemd state on netbox2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:53:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T307525)', diff saved to https://phabricator.wikimedia.org/P29212 and previous config saved to /var/cache/conftool/dbconfig/20220531-145331-ladsgroup.json [14:53:52] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:55:26] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: make gitlab1003 new replica [puppet] - 10https://gerrit.wikimedia.org/r/801651 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [14:56:09] (03PS1) 10David Caro: wmcs: added missing __init__.py and relted lint fixes [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801732 [15:04:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T60674)', diff saved to https://phabricator.wikimedia.org/P29213 and previous config saved to /var/cache/conftool/dbconfig/20220531-150411-ladsgroup.json [15:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:19] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [15:08:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P29214 and previous config saved to /var/cache/conftool/dbconfig/20220531-150836-ladsgroup.json [15:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:18] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:11:45] (JobUnavailable) firing: (2) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:12:15] ^ expected because of gitlab-replica migration [15:12:33] !log migrate gitlab-replica to gitlab1003 [15:12:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance [15:15:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance [15:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T309311)', diff saved to https://phabricator.wikimedia.org/P29215 and previous config saved to /var/cache/conftool/dbconfig/20220531-151515-ladsgroup.json [15:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:24] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [15:16:45] (JobUnavailable) firing: (2) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:19:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P29216 and previous config saved to /var/cache/conftool/dbconfig/20220531-151916-ladsgroup.json [15:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P29217 and previous config saved to /var/cache/conftool/dbconfig/20220531-152341-ladsgroup.json [15:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:50] (03PS2) 10David Caro: wmcs: added missing __init__.py and relted lint fixes [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801732 [15:30:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T309311)', diff saved to https://phabricator.wikimedia.org/P29218 and previous config saved to /var/cache/conftool/dbconfig/20220531-153053-ladsgroup.json [15:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:00] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [15:31:10] PROBLEM - Check systemd state on netboxdb2002 is CRITICAL: CRITICAL - degraded: The following units failed: postgresql@13-main.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:34:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P29219 and previous config saved to /var/cache/conftool/dbconfig/20220531-153422-ladsgroup.json [15:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:43] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic, and 2 others: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10akosiaris) Didn't work btw, turns out that eventgate also needs a service-runner bump. PR at https://... [15:34:58] (03PS1) 10Jcrespo: dbbackups::check: Add enabled flag to have a passive host on codfw [puppet] - 10https://gerrit.wikimedia.org/r/801741 (https://phabricator.wikimedia.org/T283017) [15:36:12] 10SRE, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10bking) ^^ We've had 2 different hosts fail to reimage in the same way. Per IRC conversation with Infrastructure Foundations, "when you hit this kind of issues your best bet... [15:37:58] RECOVERY - Check systemd state on netboxdb2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:38:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T307525)', diff saved to https://phabricator.wikimedia.org/P29220 and previous config saved to /var/cache/conftool/dbconfig/20220531-153846-ladsgroup.json [15:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:52] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [15:39:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 5%: Maint done', diff saved to https://phabricator.wikimedia.org/P29221 and previous config saved to /var/cache/conftool/dbconfig/20220531-153859-ladsgroup.json [15:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [15:40:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [15:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T307525)', diff saved to https://phabricator.wikimedia.org/P29222 and previous config saved to /var/cache/conftool/dbconfig/20220531-154025-ladsgroup.json [15:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:41] (03PS1) 10Ahmon Dancy: Revert "scap.cfg: Enable rsync_cdbs in beta" [puppet] - 10https://gerrit.wikimedia.org/r/801746 [15:41:52] (03PS1) 10Elukey: Set cluster group to ml-serve for ml-staging control plane nodes [puppet] - 10https://gerrit.wikimedia.org/r/801742 (https://phabricator.wikimedia.org/T302195) [15:42:05] (03PS2) 10Ahmon Dancy: Revert "scap.cfg: Enable rsync_cdbs in beta" [puppet] - 10https://gerrit.wikimedia.org/r/801746 (https://phabricator.wikimedia.org/T297326) [15:42:23] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic, and 2 others: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10Ottomata) Ah sorry about that, should have realized. Docs here: https://wikitech.wikimedia.org/wiki/... [15:43:11] (03CR) 10Elukey: [C: 03+2] Set cluster group to ml-serve for ml-staging control plane nodes [puppet] - 10https://gerrit.wikimedia.org/r/801742 (https://phabricator.wikimedia.org/T302195) (owner: 10Elukey) [15:44:50] PROBLEM - Check systemd state on netboxdb2002 is CRITICAL: CRITICAL - degraded: The following units failed: postgresql@13-main.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:44:56] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:44:57] 10SRE: a couple longer running icinga alerts to be fixed - https://phabricator.wikimedia.org/T309257 (10Dzahn) 05Open→03Declined [15:45:26] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic, and 2 others: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10Ottomata) Hm, I think we stopped using the github commit sha to install, and instead rely on NPM lik... [15:45:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P29223 and previous config saved to /var/cache/conftool/dbconfig/20220531-154558-ladsgroup.json [15:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:24] 10SRE, 10observability: a couple longer running icinga alerts to be fixed - https://phabricator.wikimedia.org/T309257 (10Dzahn) [15:49:24] RECOVERY - Check systemd state on netboxdb2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T60674)', diff saved to https://phabricator.wikimedia.org/P29224 and previous config saved to /var/cache/conftool/dbconfig/20220531-154928-ladsgroup.json [15:49:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2079.codfw.wmnet with reason: Maintenance [15:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2079.codfw.wmnet with reason: Maintenance [15:49:36] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [15:49:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 12 hosts with reason: Maintenance [15:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 12 hosts with reason: Maintenance [15:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:55] 10SRE, 10observability: a couple longer running icinga alerts to be fixed - https://phabricator.wikimedia.org/T309257 (10Dzahn) I have tried pinging individual IRC channels as well as individual tasks in the past. I don't know what the solution is to get attention to Icinga alerts. [15:51:44] (03CR) 10Jcrespo: [C: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1002/35630/backupmon1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/801741 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [15:51:44] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:37] (03CR) 10Jcrespo: [C: 03+2] dbbackups::check: Add enabled flag to have a passive host on codfw [puppet] - 10https://gerrit.wikimedia.org/r/801741 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [15:54:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 10%: Maint done', diff saved to https://phabricator.wikimedia.org/P29225 and previous config saved to /var/cache/conftool/dbconfig/20220531-155403-ladsgroup.json [15:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:12] (03PS1) 10Elukey: role::pki::multirootca: add settings for the ml-staging cluster [puppet] - 10https://gerrit.wikimedia.org/r/801744 (https://phabricator.wikimedia.org/T302195) [15:56:26] (03PS2) 10Elukey: role::pki::multirootca: add settings for the ml-staging cluster [puppet] - 10https://gerrit.wikimedia.org/r/801744 (https://phabricator.wikimedia.org/T302195) [15:56:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T307525)', diff saved to https://phabricator.wikimedia.org/P29226 and previous config saved to /var/cache/conftool/dbconfig/20220531-155634-ladsgroup.json [15:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:41] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [15:57:00] 10SRE, 10WMF-General-or-Unknown, 10WMF-Legal, 10Documentation, and 2 others: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270 (10Dzahn) [15:57:39] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1006.wikimedia.org with OS bullseye [15:57:43] 10SRE, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1006.wikimedia.org with OS bullseye [15:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:10] (03PS1) 10PipelineBot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/801745 [15:58:26] (03PS1) 10Elukey: admin_ng: set cfssl-issuer's values for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/801766 (https://phabricator.wikimedia.org/T302195) [15:59:18] PROBLEM - Disk space on an-worker1080 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/journal 395 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1080&var-datasource=eqiad+prometheus/ops [16:00:04] jbond and rzl: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220531T1600). [16:00:04] Lucas_WMDE and tgr: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:15] o/ [16:00:24] o/ [16:00:24] jbond: are you around to grab this? I had a last-minute meeting reschedule [16:00:59] Lucas_WMDE, tgr: worst case, I'll be with you in the second half of the window, apologies [16:01:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P29227 and previous config saved to /var/cache/conftool/dbconfig/20220531-160103-ladsgroup.json [16:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:19] ok [16:02:21] (03CR) 10Jelto: [C: 03+2] wikimedia.org: make gitlab1003 the new gitlab-replica [dns] - 10https://gerrit.wikimedia.org/r/801652 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [16:02:53] (03PS2) 10Jelto: wikimedia.org: make gitlab1003 the new gitlab-replica [dns] - 10https://gerrit.wikimedia.org/r/801652 (https://phabricator.wikimedia.org/T307142) [16:03:36] (03PS2) 10David Caro: Fix spelling errors [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801730 [16:03:38] (03PS3) 10David Caro: wmcs: added missing __init__.py and relted lint fixes [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801732 [16:03:40] (03PS2) 10David Caro: Add readme, configure script and missing modules [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/799379 [16:04:22] (03CR) 10David Caro: [C: 03+1] "This will need a rebase, the wmcs branch has been rebased on top of master." [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/791030 (owner: 10Majavah) [16:05:31] (03CR) 10David Caro: "This will need a rebase, the wmcs branch has been rebased on top of master." [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/738881 (owner: 10Arturo Borrero Gonzalez) [16:05:34] (03CR) 10David Caro: "This will need a rebase, the wmcs branch has been rebased on top of master." [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/755707 (owner: 10Arturo Borrero Gonzalez) [16:05:39] (03CR) 10David Caro: "This will need a rebase, the wmcs branch has been rebased on top of master." [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/756589 (owner: 10Arturo Borrero Gonzalez) [16:05:54] (03CR) 10David Caro: "This will need a rebase, the wmcs branch has been rebased on top of master." [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773612 (owner: 10Arturo Borrero Gonzalez) [16:05:59] (03CR) 10David Caro: "This will need a rebase, the wmcs branch has been rebased on top of master." [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/774459 (owner: 10Arturo Borrero Gonzalez) [16:06:17] (03PS2) 10BryanDavis: developer-portal: add developer.wikimedia.org to CDN config [puppet] - 10https://gerrit.wikimedia.org/r/800181 (https://phabricator.wikimedia.org/T297140) [16:07:24] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1006.wikimedia.org with OS bullseye [16:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:30] 10SRE, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host cloudelastic1006.wikimedia.org with OS bullseye [16:07:32] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1006.wikimedia.org with OS bullseye [16:07:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:37] 10SRE, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host cloudelastic1006.wikimedia.org with OS bullseye executed with errors: - cloudelas... [16:07:49] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1006.wikimedia.org with OS bullseye [16:07:49] (03PS3) 10Majavah: wmcs: toolforge: grid: add a cookbook to reboot grid workers [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/791030 [16:07:54] 10SRE, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1006.wikimedia.org with OS bullseye executed with errors: - cloudela... [16:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:24] PROBLEM - Host db2088.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:08:32] (03CR) 10Majavah: wmcs: toolforge: grid: add a cookbook to reboot grid workers (032 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/791030 (owner: 10Majavah) [16:09:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 50%: Maint done', diff saved to https://phabricator.wikimedia.org/P29228 and previous config saved to /var/cache/conftool/dbconfig/20220531-160907-ladsgroup.json [16:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:09] (03CR) 10CI reject: [V: 04-1] wmcs: toolforge: grid: add a cookbook to reboot grid workers [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/791030 (owner: 10Majavah) [16:11:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P29229 and previous config saved to /var/cache/conftool/dbconfig/20220531-161139-ladsgroup.json [16:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:59] (03PS4) 10Majavah: wmcs: toolforge: grid: add a cookbook to reboot grid workers [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/791030 [16:12:04] (03CR) 10David Caro: wmcs: added missing __init__.py and relted lint fixes (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801732 (owner: 10David Caro) [16:13:00] (03CR) 10Vgutierrez: [C: 03+1] "TLS certificate looks good (includes developer.wm.org in the SAN list):" [puppet] - 10https://gerrit.wikimedia.org/r/800181 (https://phabricator.wikimedia.org/T297140) (owner: 10BryanDavis) [16:13:04] (03CR) 10Giuseppe Lavagetto: Add the master from the primary DC to the secondary DC load arrays (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799685 (https://phabricator.wikimedia.org/T134809) (owner: 10Tim Starling) [16:13:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [16:13:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [16:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3318 (T60674)', diff saved to https://phabricator.wikimedia.org/P29230 and previous config saved to /var/cache/conftool/dbconfig/20220531-161329-ladsgroup.json [16:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:37] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [16:14:31] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1006.wikimedia.org with OS bullseye [16:14:36] 10SRE, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host cloudelastic1006.wikimedia.org with OS bullseye [16:14:40] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:14:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T309311)', diff saved to https://phabricator.wikimedia.org/P29231 and previous config saved to /var/cache/conftool/dbconfig/20220531-161609-ladsgroup.json [16:16:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [16:16:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [16:16:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:17] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [16:16:20] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frbackup2002 - https://phabricator.wikimedia.org/T306842 (10Jgreen) [16:16:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T309311)', diff saved to https://phabricator.wikimedia.org/P29232 and previous config saved to /var/cache/conftool/dbconfig/20220531-161618-ladsgroup.json [16:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:45] (JobUnavailable) resolved: Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:17:34] PROBLEM - Disk space on an-worker1078 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/journal 374 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1078&var-datasource=eqiad+prometheus/ops [16:18:13] !log switch gitlab-replica to gitlab1003 done [16:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:32] PROBLEM - Disk space on an-worker1090 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/journal 374 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1090&var-datasource=eqiad+prometheus/ops [16:20:12] (03PS1) 10Jgreen: nsca_frack.cfg.erb switch frbackup2001 to frbackup2002 [puppet] - 10https://gerrit.wikimedia.org/r/801768 (https://phabricator.wikimedia.org/T306842) [16:21:02] (03CR) 10David Caro: [C: 03+2] "👍" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/791030 (owner: 10Majavah) [16:21:28] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:21:40] (03CR) 10Jgreen: [C: 03+2] nsca_frack.cfg.erb switch frbackup2001 to frbackup2002 [puppet] - 10https://gerrit.wikimedia.org/r/801768 (https://phabricator.wikimedia.org/T306842) (owner: 10Jgreen) [16:23:40] (03PS2) 10Majavah: sonofgridengine: grid_configurator: make the grid master a submit host [puppet] - 10https://gerrit.wikimedia.org/r/801385 (https://phabricator.wikimedia.org/T309525) [16:23:42] (03PS1) 10Majavah: sonofgridengine: grid_configurator: filter 'normal' stderr output [puppet] - 10https://gerrit.wikimedia.org/r/801770 (https://phabricator.wikimedia.org/T309525) [16:23:45] !log robh@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1006.wikimedia.org with OS bullseye [16:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:56] 10SRE, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host cloudelastic1006.wikimedia.org with OS bullseye executed with errors: - cloudelas... [16:24:06] (03PS1) 10Jbond: C:postgress:slave: add notification for lack of replication init [puppet] - 10https://gerrit.wikimedia.org/r/801771 (https://phabricator.wikimedia.org/T296452) [16:24:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 100%: Maint done', diff saved to https://phabricator.wikimedia.org/P29233 and previous config saved to /var/cache/conftool/dbconfig/20220531-162411-ladsgroup.json [16:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:03] (03CR) 10CI reject: [V: 04-1] sonofgridengine: grid_configurator: filter 'normal' stderr output [puppet] - 10https://gerrit.wikimedia.org/r/801770 (https://phabricator.wikimedia.org/T309525) (owner: 10Majavah) [16:25:49] (03PS2) 10Majavah: sonofgridengine: grid_configurator: filter 'normal' stderr output [puppet] - 10https://gerrit.wikimedia.org/r/801770 (https://phabricator.wikimedia.org/T309525) [16:26:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P29234 and previous config saved to /var/cache/conftool/dbconfig/20220531-162644-ladsgroup.json [16:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:13] (03Merged) 10jenkins-bot: wmcs: toolforge: grid: add a cookbook to reboot grid workers [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/791030 (owner: 10Majavah) [16:29:12] Lucas_WMDE, tgr: okay, looking now! sorry for the delay [16:29:39] (03PS3) 10Majavah: sonofgridengine: grid_configurator: filter 'normal' stderr output [puppet] - 10https://gerrit.wikimedia.org/r/801770 (https://phabricator.wikimedia.org/T309525) [16:30:02] Lucas_WMDE: fwiw, one way to improve confidence in future apache config changes would be to write a couple of quick test cases with httpbb -- https://wikitech.wikimedia.org/wiki/Httpbb [16:30:09] happy to help get those started if you're interested [16:30:47] oh, that sounds nice [16:31:22] so would one possible workflow be to pull the puppet.git change to an mwdebug host and then run the new tests against that? [16:31:46] (03PS10) 10JMeybohm: Replace kubeyaml with kubeconform (if available) [deployment-charts] - 10https://gerrit.wikimedia.org/r/791794 (https://phabricator.wikimedia.org/T306165) [16:32:01] yeah exactly [16:32:10] RECOVERY - cassandra-c CQL 10.192.48.184:9042 on restbase2027 is OK: TCP OK - 0.033 second response time on 10.192.48.184 port 9042 https://phabricator.wikimedia.org/T93886 [16:32:16] (03CR) 10Jbond: [C: 03+2] C:postgress:slave: add notification for lack of replication init [puppet] - 10https://gerrit.wikimedia.org/r/801771 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [16:32:27] for appserver apache changes, we have a script that automatest hat [16:32:43] it’s a bit late for me today, but I could try to write some tests tomorrow and then come back for the Thursday puppet request window [16:32:57] also I'm not sure this works as intended: -- I think by default Files doesn't do a regex match [16:33:01] (03CR) 10JMeybohm: Replace kubeyaml with kubeconform (if available) (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/791794 (https://phabricator.wikimedia.org/T306165) (owner: 10JMeybohm) [16:33:07] from the documentation it looks like you want either FilesMatch or Files ~ [16:33:14] (https://httpd.apache.org/docs/trunk/mod/core.html#files) [16:33:34] ah, I must’ve missed that [16:33:39] so it globs by default? [16:33:54] I'm not super familiar myself either, but it looks that way [16:34:08] RECOVERY - Host db2088.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.58 ms [16:34:48] rzl: Lucas_WMDE: tgr: sorry copmpletly missed the ping for the puppet cr window yesterday. ley me know if i can help [16:35:17] Lucas_WMDE: and, re tests -- I wouldn't hold up this patch for it, unless you'd prefer to wait! happy to look at those in another puppet window if you like, but you can also just add me as a reviewer any time [16:35:25] would be nice if it just supported , I only want two files after all ^^ [16:35:45] rzl: let’s postpone it to Thursday, it’s really not urgent [16:35:55] okay, sounds good [16:36:04] I’ll fix up the and try to add tests and then CC you on Gerrit [16:36:11] thanks a lot for the review so far :) [16:36:22] RECOVERY - Host db2088 is UP: PING OK - Packet loss = 0%, RTA = 33.18 ms [16:36:46] cool! when you add a new test file you'll also want to update https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/httpbb.pp but I can walk you through that too [16:36:55] (03PS1) 10Gergő Tisza: Revert "Tombstone the old session on SessionBackend::resetId()" [core] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/801748 (https://phabricator.wikimedia.org/T299193) [16:37:15] jbond: no worries [16:37:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [16:37:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [16:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T60674)', diff saved to https://phabricator.wikimedia.org/P29235 and previous config saved to /var/cache/conftool/dbconfig/20220531-163737-ladsgroup.json [16:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:48] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [16:38:10] rzl: ack i have put them on my list to take a look tomorrow [16:38:43] jbond: all good! no need to do anything [16:38:52] ok cool even better :) [16:39:01] thanks <3 and just ping me if that changes [16:39:25] tgr: merging yours in a sec -- anything you'd like to test manually? [16:39:28] (or anything I should) [16:39:56] I'll run a maintenance script and check logstash [16:40:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T60674)', diff saved to https://phabricator.wikimedia.org/P29236 and previous config saved to /var/cache/conftool/dbconfig/20220531-164026-ladsgroup.json [16:40:31] sounds good [16:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:56] dduvall: dancy: I have a backport needed for the train ( https://gerrit.wikimedia.org/r/c/mediawiki/core/+/801748 ), is it OK if I just merge it? [16:41:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repool', diff saved to https://phabricator.wikimedia.org/P29237 and previous config saved to /var/cache/conftool/dbconfig/20220531-164122-ladsgroup.json [16:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:34] tgr: Yes. [16:41:45] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35631/console" [puppet] - 10https://gerrit.wikimedia.org/r/800683 (https://phabricator.wikimedia.org/T285896) (owner: 10Gergő Tisza) [16:41:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T307525)', diff saved to https://phabricator.wikimedia.org/P29238 and previous config saved to /var/cache/conftool/dbconfig/20220531-164149-ladsgroup.json [16:41:55] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10Cmjohnson) This task does not require DC-OPs tag, once you have moved the data, please decommission labstores and crea... [16:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:56] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [16:42:03] (03CR) 10Marostegui: [C: 03+1] wmnet: Update s7-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/801685 (https://phabricator.wikimedia.org/T309617) (owner: 10Ladsgroup) [16:42:07] (03Abandoned) 10Majavah: sonofgridengine: grid_configurator: make the grid master a submit host [puppet] - 10https://gerrit.wikimedia.org/r/801385 (https://phabricator.wikimedia.org/T309525) (owner: 10Majavah) [16:42:10] (03CR) 10Gergő Tisza: [C: 03+2] Revert "Tombstone the old session on SessionBackend::resetId()" [core] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/801748 (https://phabricator.wikimedia.org/T299193) (owner: 10Gergő Tisza) [16:42:18] (03CR) 10RLazarus: [V: 03+1 C: 03+2] Log output of scheduled MediaWiki maintenance scripts [puppet] - 10https://gerrit.wikimedia.org/r/800683 (https://phabricator.wikimedia.org/T285896) (owner: 10Gergő Tisza) [16:42:24] PROBLEM - Disk space on analytics1072 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/journal 227 MB (2% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=analytics1072&var-datasource=eqiad+prometheus/ops [16:42:51] 10SRE, 10SRE-swift-storage, 10ops-eqiad: Power drain and restart of ms-be1059 - https://phabricator.wikimedia.org/T307667 (10Cmjohnson) They finally assigned it to someone to come and replace the mother board, waiting on them to contact me to schedule the visit [16:44:08] PROBLEM - Disk space on analytics1069 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/journal 302 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=analytics1069&var-datasource=eqiad+prometheus/ops [16:44:20] PROBLEM - Host db2088 is DOWN: PING CRITICAL - Packet loss = 100% [16:44:22] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:45:07] tgr: updated on mwmaint1002 [16:46:12] (03CR) 10Marostegui: [C: 03+1] mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/801684 (https://phabricator.wikimedia.org/T309617) (owner: 10Ladsgroup) [16:46:24] RECOVERY - Host db2088 is UP: PING OK - Packet loss = 0%, RTA = 33.18 ms [16:47:07] thanks rzl! how do I start a systemd cronjob? simply 'systemctl start xxx.service'? [16:47:21] (03CR) 10Vgutierrez: [V: 03+1 C: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35632/console" [puppet] - 10https://gerrit.wikimedia.org/r/800181 (https://phabricator.wikimedia.org/T297140) (owner: 10BryanDavis) [16:48:10] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] developer-portal: add developer.wikimedia.org to CDN config [puppet] - 10https://gerrit.wikimedia.org/r/800181 (https://phabricator.wikimedia.org/T297140) (owner: 10BryanDavis) [16:48:12] (03PS4) 10Majavah: sonofgridengine: grid_configurator: filter 'normal' stderr output [puppet] - 10https://gerrit.wikimedia.org/r/801770 (https://phabricator.wikimedia.org/T309525) [16:48:14] (03PS1) 10Majavah: sonofgridengine: grid_configurator: remove hostgroup and queue entries [puppet] - 10https://gerrit.wikimedia.org/r/801774 (https://phabricator.wikimedia.org/T309525) [16:48:52] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frbackup2002 - https://phabricator.wikimedia.org/T306842 (10Jgreen) 05Open→03Resolved [16:49:14] <_joe_> tgr: yes, but you need root rights to do so [16:49:15] rzl: I guess you need root for that. Could you start the mediawiki_job_growthexperiments-listTaskCounts.service job? [16:49:40] <_joe_> heh you found out yourself :P [16:49:53] <_joe_> rzl: I can do that [16:49:58] thanks [16:50:55] <_joe_> !log mwmaint1002:~$ sudo systemctl start mediawiki_job_growthexperiments-listTaskCounts.service [16:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:41] (03CR) 10JMeybohm: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/801635 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [16:51:42] <_joe_> tgr: to check progress, tail -f /var/log/mediawiki/mediawiki_job_growthexperiments-listTaskCounts/syslog.log [16:51:49] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1006.wikimedia.org with OS bullseye [16:51:55] 10SRE, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host cloudelastic1006.wikimedia.org with OS bullseye [16:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:43] thanks! [16:55:02] thanks _joe_ [16:55:11] 10ops-codfw, 10decommission-hardware: decommission frbackup2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T309643 (10Jgreen) [16:55:14] the output does not show up in logstash. I have no idea how fast the pipeline is supposed to be, though. [16:55:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P29239 and previous config saved to /var/cache/conftool/dbconfig/20220531-165531-ladsgroup.json [16:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:56] <_joe_> tgr: I'm not sure we send that to logstash from syslog [16:56:03] not sure offhand if rsyslog is supposed to pick up the config change with the puppet run, or if something needs a restart [16:56:05] <_joe_> so unless mediawiki logs to logstash itself [16:56:24] _joe_: that's what the puppet change was supposed to do [16:56:28] <_joe_> rzl: I don't think we forward the logs at stdout to logstash [16:56:40] https://gerrit.wikimedia.org/r/800683 is what tgr just added [16:56:42] <_joe_> ah I didn't run puppet on mwmaint1002 [16:56:51] I did [16:57:54] <_joe_> I doubt that patch would work [16:58:09] <_joe_> the logs we send to udp2log and logstash have a specific format IIRC [16:58:22] <_joe_> let me check rsyslog logs first [16:58:26] (03PS5) 10Majavah: sonofgridengine: grid_configurator: filter 'normal' stderr output [puppet] - 10https://gerrit.wikimedia.org/r/801770 (https://phabricator.wikimedia.org/T309525) [16:58:28] (03PS2) 10Majavah: sonofgridengine: grid_configurator: remove hostgroup and queue entries [puppet] - 10https://gerrit.wikimedia.org/r/801774 (https://phabricator.wikimedia.org/T309525) [16:59:45] <_joe_> so the rule is correct [16:59:48] 10SRE, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10RobH) Updated firmware on cloudelastic1006 to the following 10g 21.40.25.31 to 21.85.21.92 idrac 4.00.00.00 to 5.10.10.00 bios 2.4.8 to 2.14.2 Once those were done, the... [16:59:55] <_joe_> in terms of syntax it is valid [17:00:03] hm, I haven't thought of that [17:00:07] <_joe_> not sure it does what you want, but it's a bit too late for me to debug it [17:00:37] the syslog lines from MediaWiki go through the Monolog logstash formatter [17:00:38] (03Merged) 10jenkins-bot: Revert "Tombstone the old session on SessionBackend::resetId()" [core] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/801748 (https://phabricator.wikimedia.org/T299193) (owner: 10Gergő Tisza) [17:00:49] <_joe_> tgr: I would suppose the best way to achieve what you want, and also future-proof, would be to configure mediawiki in production [17:00:53] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:56] not sure about apache logs and PHP fatals which are also sent via rsyslog [17:01:05] <_joe_> to also send logs to logstash directly on maint servers [17:01:11] (03PS1) 10Catrope: doc.wikimedia.org CSP: Also allow form submissions to enwiki/wikidata [puppet] - 10https://gerrit.wikimedia.org/r/801776 (https://phabricator.wikimedia.org/T285570) [17:01:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T309311)', diff saved to https://phabricator.wikimedia.org/P29240 and previous config saved to /var/cache/conftool/dbconfig/20220531-170113-ladsgroup.json [17:01:19] <_joe_> tgr: those are handled separately and not via that template [17:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:21] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [17:01:34] aren't those the other lines of the template? [17:01:51] anyway, your input would be welcome on T285896 / T307402 [17:01:52] T307402: Maintenance scripts should consistently log errors - https://phabricator.wikimedia.org/T307402 [17:01:52] T285896: Ingest logs from scheduled maintenance scripts at WMF in Logstash - https://phabricator.wikimedia.org/T285896 [17:01:56] 10SRE, 10Codex, 10WVUI, 10ContentSecurityPolicy, and 2 others: WVUI and Codex demos: CSP stopping typeahead input demos working - https://phabricator.wikimedia.org/T285570 (10Catrope) 05Resolved→03Open There's another issue: when you press the "Enter" key or click the Search button in these demos nothi... [17:01:58] <_joe_> tgr: ahhh wait [17:02:01] <_joe_> I found the issue [17:03:13] (03PS1) 10Majavah: sonofgridengine: grid_configurator: remove hosts entries [puppet] - 10https://gerrit.wikimedia.org/r/801777 (https://phabricator.wikimedia.org/T309525) [17:03:21] (03CR) 10Giuseppe Lavagetto: "This doesn't work because this rsyslog rule gets loaded with priority 40, while the rules for the jobs created by systemd::timer have prio" [puppet] - 10https://gerrit.wikimedia.org/r/800683 (https://phabricator.wikimedia.org/T285896) (owner: 10Gergő Tisza) [17:03:32] <_joe_> tgr: added a comment on the patch [17:03:56] <_joe_> rzl: basically we'd need to create this as a separate rsyslog::rule at priority 19 or lower [17:04:09] hm okay [17:04:14] I'm a bit out of my depth, rsyslog wise :) [17:04:18] <_joe_> I'll take a look :) [17:04:22] thanks! do you think it is worth fixing, or would the format be wrong for logstash anyway? [17:04:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:04] <_joe_> rzl: why, don't you love the simplicity of rule processing, the expressivity of the DSL, and the ample descriptivity of logs at anything other than the "firehose" setting? [17:05:12] <_joe_> tgr: yeah I think it's worth a try [17:05:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:05:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:05:32] <_joe_> but I'm not sure I can do it today or tomorrow - then I'm off for some time [17:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:40] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10bking) [17:05:50] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1006.wikimedia.org with reason: host reimage [17:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:57] <_joe_> (it's 7 pm and I've been around 12 hours already) [17:05:59] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:06:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:06:21] (03CR) 10Majavah: "Turns out this was just an issue with the old .eqiad.wmflabs aliases, the master and shadow are already covered by the 'sgegrid' rule." [puppet] - 10https://gerrit.wikimedia.org/r/801385 (https://phabricator.wikimedia.org/T309525) (owner: 10Majavah) [17:06:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:46] it's not urgent. I can try writing the fix. [17:07:35] <_joe_> tgr: ah ofc you use the mediawiki template which is also defined in that file [17:07:47] <_joe_> tgr: frankly it's worth finding a proper fix [17:08:00] <_joe_> and I can try taking a stab tomorrow if I have spare time [17:08:34] Cool, thx. As I said, it's not time-sensitive at all. [17:08:53] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1006.wikimedia.org with reason: host reimage [17:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:27] I can implement logging on the MediaWiki side, but the opinion in the task was to go in the other direction. [17:10:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P29241 and previous config saved to /var/cache/conftool/dbconfig/20220531-171036-ladsgroup.json [17:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:44] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10bking) Per IRC conversation with RobH, we will work together to reimage one cloudelastic host at a time (6 in the cluster). We'll try to keep 5 hosts in the clu... [17:10:48] <_joe_> yeah I would generally agree [17:12:34] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:14:52] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P29242 and previous config saved to /var/cache/conftool/dbconfig/20220531-171619-ladsgroup.json [17:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:56] 10SRE, 10Release-Engineering-Team, 10Scap, 10serviceops: Deploy Scap version 4.8.0 - https://phabricator.wikimedia.org/T309116 (10thcipriani) [17:19:12] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:19:44] (03PS1) 10Jeena Huneidi: testwikis wikis to 1.39.0-wmf.14 refs T308067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801779 [17:19:46] (03CR) 10Jeena Huneidi: [C: 03+2] testwikis wikis to 1.39.0-wmf.14 refs T308067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801779 (owner: 10Jeena Huneidi) [17:21:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:40] (03Merged) 10jenkins-bot: testwikis wikis to 1.39.0-wmf.14 refs T308067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801779 (owner: 10Jeena Huneidi) [17:22:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:22:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:40] !log jhuneidi@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.14 refs T308067 [17:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:46] T308067: 1.39.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T308067 [17:23:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T60674)', diff saved to https://phabricator.wikimedia.org/P29243 and previous config saved to /var/cache/conftool/dbconfig/20220531-172541-ladsgroup.json [17:25:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [17:25:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [17:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:49] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [17:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:30] 10SRE, 10ops-codfw, 10DBA: db2088 crashed - https://phabricator.wikimedia.org/T309485 (10Papaul) a:05Papaul→03Marostegui I removed the power for 10 minutes, the server came backup. IDRAC log not showing any HW issues. I upgrade the BIOS and IDRAC on the node. The server is back up. [17:28:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:16] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:28:40] PROBLEM - Hadoop Namenode - Stand By on an-master1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [17:29:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:29:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:17] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1006.wikimedia.org with OS bullseye [17:29:22] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host cloudelastic1006.wikimedia.org with OS bullseye completed: - cloudela... [17:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:38] PROBLEM - Check systemd state on an-master1002 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-namenode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:29:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:28] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refine_event.service,refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:31:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P29244 and previous config saved to /var/cache/conftool/dbconfig/20220531-173124-ladsgroup.json [17:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:30] PROBLEM - Check systemd state on an-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-namenode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:32:30] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q4:(Need By: TBD) rack/setup/install backup2009 - https://phabricator.wikimedia.org/T307049 (10Papaul) [17:38:30] PROBLEM - Hadoop JournalNode on an-worker1090 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.qjournal.server.JournalNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Journalnode_process [17:38:31] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:38:34] PROBLEM - Hadoop DataNode on analytics1069 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [17:38:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:46] PROBLEM - Hadoop DataNode on an-worker1078 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [17:38:46] PROBLEM - Hadoop DataNode on an-worker1090 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [17:38:46] PROBLEM - Hadoop JournalNode on analytics1072 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.qjournal.server.JournalNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Journalnode_process [17:38:47] This is known ^ outage [17:38:52] PROBLEM - Check systemd state on an-worker1078 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service,hadoop-hdfs-journalnode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:38:52] PROBLEM - Hadoop JournalNode on analytics1069 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.qjournal.server.JournalNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Journalnode_process [17:38:56] PROBLEM - Hadoop DataNode on analytics1072 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [17:39:12] PROBLEM - Hadoop JournalNode on an-worker1080 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.qjournal.server.JournalNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Journalnode_process [17:39:26] (03CR) 10Aaron Schulz: [C: 03+1] Add "db-mainstash" entry to $wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752807 (https://phabricator.wikimedia.org/T212129) (owner: 10Aaron Schulz) [17:39:32] PROBLEM - Hadoop DataNode on an-worker1080 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [17:39:34] working on a downtime now [17:39:46] PROBLEM - Check systemd state on an-worker1090 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service,hadoop-hdfs-journalnode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:39:50] PROBLEM - Check systemd state on analytics1069 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service,hadoop-hdfs-journalnode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:39:50] PROBLEM - Check systemd state on analytics1072 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service,hadoop-hdfs-journalnode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:40:03] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on analytics1069.eqiad.wmnet with reason: Hadoop incident btullis [17:40:05] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on analytics1069.eqiad.wmnet with reason: Hadoop incident btullis [17:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:10] PROBLEM - Hadoop JournalNode on an-worker1078 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.qjournal.server.JournalNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Journalnode_process [17:40:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:16] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on analytics1072.eqiad.wmnet with reason: Hadoop incident btullis [17:40:18] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on analytics1072.eqiad.wmnet with reason: Hadoop incident btullis [17:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:29] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-worker1078.eqiad.wmnet with reason: Hadoop incident btullis [17:40:31] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-worker1078.eqiad.wmnet with reason: Hadoop incident btullis [17:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:35] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-worker1080.eqiad.wmnet with reason: Hadoop incident btullis [17:40:37] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-worker1080.eqiad.wmnet with reason: Hadoop incident btullis [17:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:42] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-worker1090.eqiad.wmnet with reason: Hadoop incident btullis [17:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:43] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-worker1090.eqiad.wmnet with reason: Hadoop incident btullis [17:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:56] PROBLEM - Hadoop Namenode - Primary on an-master1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process [17:43:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:33] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q4:(Need By: TBD) rack/setup/install backup2009 - https://phabricator.wikimedia.org/T307049 (10Papaul) [17:44:04] RECOVERY - Disk space on an-worker1080 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1080&var-datasource=eqiad+prometheus/ops [17:44:22] RECOVERY - Check systemd state on analytics1069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:44:44] RECOVERY - Hadoop JournalNode on an-worker1078 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.qjournal.server.JournalNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Journalnode_process [17:44:48] RECOVERY - Disk space on analytics1072 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=analytics1072&var-datasource=eqiad+prometheus/ops [17:45:01] 10SRE, 10Wikibugs: wikibugs has stopped showing phab/gerrit comments on IRC as of 2022-05-22Z17:00 - https://phabricator.wikimedia.org/T308995 (10MZMcBride) 05Open→03Resolved a:03MZMcBride >>! In T308995#7951881, @Marostegui wrote: > @valhallasw if you can update https://www.mediawiki.org/wiki/Wikibugs t... [17:45:09] 10SRE, 10Wikibugs: wikibugs has stopped showing phab/gerrit comments on IRC as of 2022-05-22Z17:00 - https://phabricator.wikimedia.org/T308995 (10MZMcBride) a:05MZMcBride→03valhallasw [17:45:13] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host backup2009.mgmt.codfw.wmnet with reboot policy FORCED [17:45:13] (03PS2) 10Majavah: sonofgridengine: grid_configurator: remove hosts entries [puppet] - 10https://gerrit.wikimedia.org/r/801777 (https://phabricator.wikimedia.org/T309525) [17:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:22] RECOVERY - Hadoop JournalNode on an-worker1090 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.qjournal.server.JournalNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Journalnode_process [17:45:26] RECOVERY - Hadoop DataNode on analytics1069 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [17:45:38] RECOVERY - Hadoop DataNode on an-worker1078 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [17:45:40] RECOVERY - Hadoop DataNode on an-worker1090 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [17:45:40] RECOVERY - Hadoop JournalNode on analytics1072 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.qjournal.server.JournalNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Journalnode_process [17:45:44] RECOVERY - Check systemd state on an-worker1078 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:45:44] RECOVERY - Hadoop JournalNode on analytics1069 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.qjournal.server.JournalNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Journalnode_process [17:45:48] RECOVERY - Hadoop DataNode on analytics1072 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [17:46:00] 10SRE, 10Toolhub, 10serviceops, 10Patch-For-Review, 10Service-deployment-requests: New Service Request Toolhub - https://phabricator.wikimedia.org/T280881 (10bd808) [17:46:02] RECOVERY - Hadoop JournalNode on an-worker1080 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.qjournal.server.JournalNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Journalnode_process [17:46:22] RECOVERY - Hadoop DataNode on an-worker1080 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [17:46:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T309311)', diff saved to https://phabricator.wikimedia.org/P29245 and previous config saved to /var/cache/conftool/dbconfig/20220531-174629-ladsgroup.json [17:46:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [17:46:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [17:46:34] RECOVERY - Disk space on analytics1069 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=analytics1069&var-datasource=eqiad+prometheus/ops [17:46:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [17:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:36] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [17:46:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [17:46:38] RECOVERY - Check systemd state on an-worker1090 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:42] RECOVERY - Check systemd state on analytics1072 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:46:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T309311)', diff saved to https://phabricator.wikimedia.org/P29246 and previous config saved to /var/cache/conftool/dbconfig/20220531-174642-ladsgroup.json [17:46:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:01] 10SRE, 10Infrastructure-Foundations, 10netops: codfw: Provision a server script can not run without a cable ID" - https://phabricator.wikimedia.org/T308768 (10Papaul) 05Open→03Resolved I tested this on backup2009 all is working with no issues. Thanks [17:47:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1114.eqiad.wmnet with reason: Maintenance [17:47:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1114.eqiad.wmnet with reason: Maintenance [17:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1114 (T60674)', diff saved to https://phabricator.wikimedia.org/P29247 and previous config saved to /var/cache/conftool/dbconfig/20220531-174753-ladsgroup.json [17:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:00] RECOVERY - Check systemd state on an-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:02] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [17:48:14] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:48:25] (03CR) 10SBassett: "I'm trying to think if there's any additional vector that gets introduced via form-action allow-lists that is different from someone accid" [puppet] - 10https://gerrit.wikimedia.org/r/801776 (https://phabricator.wikimedia.org/T285570) (owner: 10Catrope) [17:48:48] RECOVERY - Hadoop Namenode - Primary on an-master1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process [17:51:36] jouncebot: nowandnext [17:51:36] No deployments scheduled for the next 0 hour(s) and 8 minute(s) [17:51:36] In 0 hour(s) and 8 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220531T1800) [17:51:48] (03CR) 10RLazarus: "Heh, I was just about to say -- I can take care of merging this in the puppet repo when the time comes, but I can't review the semantics f" [puppet] - 10https://gerrit.wikimedia.org/r/801776 (https://phabricator.wikimedia.org/T285570) (owner: 10Catrope) [17:54:05] !log jhuneidi@deploy1002 Finished scap: testwikis wikis to 1.39.0-wmf.14 refs T308067 (duration: 31m 24s) [17:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:12] T308067: 1.39.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T308067 [17:54:58] (03PS1) 10Majavah: wmcs: vps: remove_instance: add support for puppet deactivation [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801784 [17:55:00] (03PS1) 10Majavah: wmcs: toolforge: add a cookbook to remove a grid node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801785 (https://phabricator.wikimedia.org/T309525) [17:55:23] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:47] !log jhuneidi@deploy1002 Pruned MediaWiki: 1.39.0-wmf.12 (duration: 01m 29s) [17:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:56] (03CR) 10CI reject: [V: 04-1] wmcs: vps: remove_instance: add support for puppet deactivation [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801784 (owner: 10Majavah) [17:58:59] (03PS19) 10Brennen Bearnes: gitlab runner: restrict docker images and services [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) [17:59:25] (03CR) 10CI reject: [V: 04-1] wmcs: toolforge: add a cookbook to remove a grid node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801785 (https://phabricator.wikimedia.org/T309525) (owner: 10Majavah) [18:00:05] jeena and dancy: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220531T1800). [18:00:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:14] RECOVERY - Disk space on an-worker1078 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1078&var-datasource=eqiad+prometheus/ops [18:01:21] (03PS1) 10Jeena Huneidi: group0 wikis to 1.39.0-wmf.14 refs T308067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801787 [18:01:23] (03CR) 10Jeena Huneidi: [C: 03+2] group0 wikis to 1.39.0-wmf.14 refs T308067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801787 (owner: 10Jeena Huneidi) [18:01:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T309311)', diff saved to https://phabricator.wikimedia.org/P29248 and previous config saved to /var/cache/conftool/dbconfig/20220531-180123-ladsgroup.json [18:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:29] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [18:02:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:16] RECOVERY - Disk space on an-worker1090 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1090&var-datasource=eqiad+prometheus/ops [18:02:48] (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.14 refs T308067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801787 (owner: 10Jeena Huneidi) [18:05:19] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/801723 (https://phabricator.wikimedia.org/T244792) (owner: 10Filippo Giunchedi) [18:05:26] (03PS4) 10Tchanders: Add QuickSurveys survey for the SimilarEditors feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787793 (https://phabricator.wikimedia.org/T307025) [18:05:46] (03PS2) 10Muehlenhoff: Add DHCP record for idp2002 [puppet] - 10https://gerrit.wikimedia.org/r/801717 [18:06:20] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.14 refs T308067 [18:06:22] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:06:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:27] T308067: 1.39.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T308067 [18:06:35] 10SRE, 10ops-codfw: Degraded RAID on ms-be2066 - https://phabricator.wikimedia.org/T309595 (10MoritzMuehlenhoff) p:05Triage→03Medium [18:06:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:06:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:22] (03CR) 10JHathaway: [C: 03+1] "looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/801723 (https://phabricator.wikimedia.org/T244792) (owner: 10Filippo Giunchedi) [18:07:29] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup2009.mgmt.codfw.wmnet with reboot policy FORCED [18:07:33] (03CR) 10Muehlenhoff: [C: 03+2] Add DHCP record for idp2002 [puppet] - 10https://gerrit.wikimedia.org/r/801717 (owner: 10Muehlenhoff) [18:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:46] (03PS1) 10Bking: Revert "Upgrade to elasticsearch 7.10.2" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/801789 [18:08:22] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [18:11:46] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [18:12:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:13:02] RECOVERY - Check systemd state on an-master1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T60674)', diff saved to https://phabricator.wikimedia.org/P29249 and previous config saved to /var/cache/conftool/dbconfig/20220531-181354-ladsgroup.json [18:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:01] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [18:14:03] 10SRE, 10ops-codfw, 10decommission-hardware: decommission frbackup2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T309643 (10Papaul) ` papaul@fasw-c-codfw# show | compare [edit interfaces interface-range disabled] member "ge-[0-1]/0/13" { ... } + member "ge-[0-1]/0/6"; [edit interfaces int... [18:14:12] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:14:38] RECOVERY - Hadoop Namenode - Stand By on an-master1002 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [18:15:15] (03PS1) 10Ladsgroup: Allow sharding in site_stats update [core] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/801751 (https://phabricator.wikimedia.org/T306589) [18:15:42] (03PS1) 10DCausse: [cirrus] Fix typo in config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801792 [18:15:44] (03PS1) 10DCausse: [cirrus] Add a custom profile for Special:NewLexeme [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801793 (https://phabricator.wikimedia.org/T307869) [18:15:48] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q4:(Need By: TBD) rack/setup/install backup2009 - https://phabricator.wikimedia.org/T307049 (10Papaul) [18:16:01] (03CR) 10CI reject: [V: 04-1] [cirrus] Add a custom profile for Special:NewLexeme [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801793 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse) [18:16:09] jouncebot: nowandnext [18:16:09] For the next 1 hour(s) and 43 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220531T1800) [18:16:09] In 1 hour(s) and 43 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220531T2000) [18:16:16] (CirrusSearchHighOldGCFrequency) firing: (3) Elasticsearch instance cloudelastic1001-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [18:16:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P29250 and previous config saved to /var/cache/conftool/dbconfig/20220531-181628-ladsgroup.json [18:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:39] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host idp1002.wikimedia.org [18:16:40] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [18:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:35] (03PS2) 10Bking: Revert "Upgrade to elasticsearch 7.10.2" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/801789 (https://phabricator.wikimedia.org/T309648) [18:19:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:19:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:20:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:22] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:20:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:20:38] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache idp1002.wikimedia.org on all recursors [18:20:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) idp1002.wikimedia.org on all recursors [18:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:15] (03PS3) 10Bking: Revert "Upgrade to elasticsearch 7.10.2" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/801789 (https://phabricator.wikimedia.org/T309648) [18:25:51] (03CR) 10Jdlrobson: [C: 03+1] Follow-up I1dee51009: Add url() to list-style-image [skins/Vector] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/801193 (https://phabricator.wikimedia.org/T309374) (owner: 10Jforrester) [18:27:47] 10SRE, 10ops-codfw, 10decommission-hardware: decommission frbackup2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T309643 (10Papaul) [18:28:11] 10SRE, 10ops-codfw, 10decommission-hardware: decommission frbackup2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T309643 (10Papaul) 05Open→03Resolved complete [18:28:14] (03PS4) 10Bking: Revert "Upgrade to elasticsearch 7.10.2" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/801789 (https://phabricator.wikimedia.org/T309648) [18:28:18] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate the IDPs to Bullseye - https://phabricator.wikimedia.org/T308214 (10MoritzMuehlenhoff) [18:29:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P29251 and previous config saved to /var/cache/conftool/dbconfig/20220531-182859-ladsgroup.json [18:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:16] (03PS1) 10Bking: Elastic: Add S3 plugin [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/801795 (https://phabricator.wikimedia.org/T309648) [18:31:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P29252 and previous config saved to /var/cache/conftool/dbconfig/20220531-183133-ladsgroup.json [18:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:36] (03CR) 10Dzahn: "ACK, though.. generally I am not sure how to get those reviews." [puppet] - 10https://gerrit.wikimedia.org/r/791678 (owner: 10Dzahn) [18:33:37] (03CR) 10Dzahn: [C: 03+2] etherpad: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/801634 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [18:33:46] (03PS2) 10Dzahn: etherpad: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/801634 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [18:35:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:03] (03PS2) 10Muehlenhoff: auditd: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/801633 (https://phabricator.wikimedia.org/T308013) [18:39:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host idp1002.wikimedia.org [18:39:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:42] jeena: Hey hey, can I backport a change to wmf.14 now? not a blocker, just want to make sure it catches the train [18:40:10] (03CR) 10Muehlenhoff: [C: 03+2] auditd: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/801633 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [18:40:18] (03PS5) 10Bking: Revert "Upgrade to elasticsearch 7.10.2" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/801789 (https://phabricator.wikimedia.org/T309648) [18:40:33] Amir1: yeah, you can do a backport but I've already deployed to group0 so we'll still have to sync [18:40:41] yeah, that's fine [18:40:49] 👍 [18:40:52] (03CR) 10Ladsgroup: [C: 03+2] Allow sharding in site_stats update [core] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/801751 (https://phabricator.wikimedia.org/T306589) (owner: 10Ladsgroup) [18:41:24] Thanks! [18:41:42] (03PS2) 10Muehlenhoff: arclamp: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/801632 (https://phabricator.wikimedia.org/T308013) [18:41:55] (03CR) 10Ebernhardson: [C: 03+2] Revert "Upgrade to elasticsearch 7.10.2" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/801789 (https://phabricator.wikimedia.org/T309648) (owner: 10Bking) [18:42:00] (03PS2) 10Ebernhardson: Elastic: Add S3 plugin [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/801795 (https://phabricator.wikimedia.org/T309648) (owner: 10Bking) [18:42:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:42:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:13] (03CR) 10Ebernhardson: [C: 03+2] Elastic: Add S3 plugin [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/801795 (https://phabricator.wikimedia.org/T309648) (owner: 10Bking) [18:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:20] (03CR) 10Muehlenhoff: [C: 03+2] arclamp: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/801632 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [18:43:56] (03PS1) 10Ladsgroup: beta: Enable multi-shard site_stats in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801796 (https://phabricator.wikimedia.org/T306589) [18:44:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P29253 and previous config saved to /var/cache/conftool/dbconfig/20220531-184404-ladsgroup.json [18:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:08] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:46:20] (03CR) 10Ladsgroup: [C: 03+2] beta: Enable multi-shard site_stats in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801796 (https://phabricator.wikimedia.org/T306589) (owner: 10Ladsgroup) [18:46:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T309311)', diff saved to https://phabricator.wikimedia.org/P29254 and previous config saved to /var/cache/conftool/dbconfig/20220531-184638-ladsgroup.json [18:46:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [18:46:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [18:46:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:45] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [18:46:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:08] (03Merged) 10jenkins-bot: beta: Enable multi-shard site_stats in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801796 (https://phabricator.wikimedia.org/T306589) (owner: 10Ladsgroup) [18:48:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:32] (03PS1) 10Papaul: Add backup2009 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/801797 (https://phabricator.wikimedia.org/T307049) [18:48:35] (03PS1) 10Muehlenhoff: Add idp1002 to DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/801798 [18:48:45] (03CR) 10Dzahn: [C: 03+2] clamav: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/801638 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [18:48:51] (03PS2) 10Dzahn: clamav: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/801638 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [18:49:30] (03PS1) 10JHathaway: mx: enable tainted data checking [puppet] - 10https://gerrit.wikimedia.org/r/801799 (https://phabricator.wikimedia.org/T286911) [18:50:36] (03CR) 10Papaul: [C: 03+2] Add backup2009 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/801797 (https://phabricator.wikimedia.org/T307049) (owner: 10Papaul) [18:52:00] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:52:04] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/801799 (https://phabricator.wikimedia.org/T286911) (owner: 10JHathaway) [18:53:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:34] (03CR) 10AGueyte: [C: 03+1] "Good to go" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787793 (https://phabricator.wikimedia.org/T307025) (owner: 10Tchanders) [18:55:12] (03CR) 10Muehlenhoff: [C: 03+2] Add idp1002 to DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/801798 (owner: 10Muehlenhoff) [18:58:13] (03PS1) 10Ladsgroup: Enable MultiShardSiteStats in several large wikis and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801801 (https://phabricator.wikimedia.org/T306589) [18:58:31] 10SRE, 10Traffic: Package and deploy ATS 9.1.4 - https://phabricator.wikimedia.org/T309651 (10ssingh) [18:58:43] (03CR) 10Dzahn: [C: 03+2] "this made me realize we ONLY use this on OTRS. I thought it was also on mx servers in general. maybe that was in the past." [puppet] - 10https://gerrit.wikimedia.org/r/801638 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [18:59:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T60674)', diff saved to https://phabricator.wikimedia.org/P29255 and previous config saved to /var/cache/conftool/dbconfig/20220531-185909-ladsgroup.json [18:59:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1126.eqiad.wmnet with reason: Maintenance [18:59:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1126.eqiad.wmnet with reason: Maintenance [18:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:15] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [18:59:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1126 (T60674)', diff saved to https://phabricator.wikimedia.org/P29256 and previous config saved to /var/cache/conftool/dbconfig/20220531-185917-ladsgroup.json [18:59:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:59:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:57] (03Merged) 10jenkins-bot: Allow sharding in site_stats update [core] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/801751 (https://phabricator.wikimedia.org/T306589) (owner: 10Ladsgroup) [18:59:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:01] 10SRE, 10Traffic: Package and deploy ATS 9.1.4 - https://phabricator.wikimedia.org/T309651 (10ssingh) ` trafficserver (9.1.2-1wm1) buster-wikimedia; urgency=medium * Non-maintainer upload. * New upstream release 9.1.2 -- Sukhbir Singh Tue, 31 May 2022 13:34:20 -0400 ` [19:00:09] 10SRE, 10Traffic: Package and deploy ATS 9.1.2 - https://phabricator.wikimedia.org/T309651 (10ssingh) [19:03:06] (03CR) 10Ladsgroup: [C: 03+2] Enable MultiShardSiteStats in several large wikis and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801801 (https://phabricator.wikimedia.org/T306589) (owner: 10Ladsgroup) [19:03:53] (03Merged) 10jenkins-bot: Enable MultiShardSiteStats in several large wikis and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801801 (https://phabricator.wikimedia.org/T306589) (owner: 10Ladsgroup) [19:06:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:06:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:42] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:07:47] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:801801|Enable MultiShardSiteStats in several large wikis and testwiki (T306589)]] (duration: 03m 09s) [19:07:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:54] T306589: Add sharding to site_stats table - https://phabricator.wikimedia.org/T306589 [19:10:37] (03PS1) 10Ladsgroup: Allow sharding in site_stats update [core] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/801752 (https://phabricator.wikimedia.org/T306589) [19:11:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:55] (03CR) 10Ladsgroup: [C: 03+2] Allow sharding in site_stats update [core] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/801752 (https://phabricator.wikimedia.org/T306589) (owner: 10Ladsgroup) [19:12:58] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.14/includes/: Backport: [[gerrit:801751|Allow sharding in site_stats update (T306589)]] (duration: 03m 25s) [19:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:06] T306589: Add sharding to site_stats table - https://phabricator.wikimedia.org/T306589 [19:14:01] (03CR) 10Herron: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/801714 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [19:14:38] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:15:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:15:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:04] RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:16:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:46] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:20:00] (03PS1) 10Ryan Kemper: Fix wrong BUILD_VERSION [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/801802 (https://phabricator.wikimedia.org/T309648) [19:21:07] (03CR) 10Ryan Kemper: [C: 03+2] Fix wrong BUILD_VERSION [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/801802 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [19:21:21] (03CR) 10Bking: [C: 03+1] Fix wrong BUILD_VERSION [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/801802 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [19:21:30] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:22:54] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_imagecatalog.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:25:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T60674)', diff saved to https://phabricator.wikimedia.org/P29257 and previous config saved to /var/cache/conftool/dbconfig/20220531-192535-ladsgroup.json [19:25:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:45] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [19:26:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [19:26:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [19:26:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T309311)', diff saved to https://phabricator.wikimedia.org/P29258 and previous config saved to /var/cache/conftool/dbconfig/20220531-192623-ladsgroup.json [19:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:34] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [19:27:34] (03Merged) 10jenkins-bot: Allow sharding in site_stats update [core] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/801752 (https://phabricator.wikimedia.org/T306589) (owner: 10Ladsgroup) [19:29:54] this will cause a bit of error spike ^ [19:29:58] but not too large [19:33:04] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.13/includes/: Backport: [[gerrit:801752|Allow sharding in site_stats update (T306589)]] (duration: 03m 20s) [19:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:11] T306589: Add sharding to site_stats table - https://phabricator.wikimedia.org/T306589 [19:36:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:28] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:40:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:40:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P29259 and previous config saved to /var/cache/conftool/dbconfig/20220531-194040-ladsgroup.json [19:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:16] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:47:00] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:50:54] (03CR) 10Eevans: [C: 03+2] "LGTM" [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/801728 (owner: 10Hnowlan) [19:51:05] (03CR) 10Eevans: [V: 03+2 C: 03+2] Add missing parameter to CalledProcessError [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/801728 (owner: 10Hnowlan) [19:51:08] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:52:26] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:55:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P29261 and previous config saved to /var/cache/conftool/dbconfig/20220531-195546-ladsgroup.json [19:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:05] RoanKattouw, Urbanecm, and cjming: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220531T2000). [20:00:05] AnaisGueyte: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:01:46] hey AnaisGueyte, around? [20:02:05] She is but she's having some issues w/her nick. She's relogging now [20:02:56] Tran: okay okay. [20:03:09] I can deploy today :) [20:03:18] let me review the patches in the meanwhile [20:04:19] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789250 (https://phabricator.wikimedia.org/T306909) (owner: 10Tchanders) [20:04:48] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793500 (https://phabricator.wikimedia.org/T306909) (owner: 10Jforrester) [20:04:57] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793501 (https://phabricator.wikimedia.org/T306909) (owner: 10Jforrester) [20:05:13] (03PS5) 10Urbanecm: Deploy SimilarEditors to the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799012 (https://phabricator.wikimedia.org/T306908) (owner: 10AGueyte) [20:05:15] Guest2825 is AnaisGueyte [20:05:23] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799012 (https://phabricator.wikimedia.org/T306908) (owner: 10AGueyte) [20:05:28] PROBLEM - SSH on restbase1018.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:05:39] Hi! I'm having issues identifying as AnaisGueyte [20:05:51] But I'm here and following the deployments steps [20:06:06] hi Guest2825! If you can describe the issues more accurately, perhaps i can help? [20:06:18] (but guest is also fine, just offering) [20:06:33] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799378 (https://phabricator.wikimedia.org/T307205) (owner: 10AGueyte) [20:06:51] nickserv doesn't update my name even when identifying and confirming I'm identified as AnaisGueyte [20:07:04] Guest2825: try /msg NickServ REGAIN [20:07:54] Thank you! [20:07:57] 🎉 [20:07:57] any time :) [20:08:10] so, let's get started i guess? [20:08:30] Yes! Thank you [20:08:35] I've checked the patches, and they all look good to me. They're also all beta-only, so it'll be mostly a bunch of syncs this time :)) [20:08:38] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - bking@cumin1001 - T309648 [20:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:46] T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648 [20:08:49] Absolutely, thanks [20:09:27] (03CR) 10Urbanecm: [C: 03+2] Add SimilarEditors extension – I: Add to i18n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789250 (https://phabricator.wikimedia.org/T306909) (owner: 10Tchanders) [20:09:31] (03PS6) 10Urbanecm: Add SimilarEditors extension – I: Add to i18n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789250 (https://phabricator.wikimedia.org/T306909) (owner: 10Tchanders) [20:09:36] (03CR) 10Urbanecm: [C: 03+2] Add SimilarEditors extension – I: Add to i18n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789250 (https://phabricator.wikimedia.org/T306909) (owner: 10Tchanders) [20:09:55] (03PS4) 10Urbanecm: Add SimilarEditors extension – II: Add to InitialiseSettings, default off [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793500 (https://phabricator.wikimedia.org/T306909) (owner: 10Jforrester) [20:09:58] (03CR) 10Urbanecm: [C: 03+2] Add SimilarEditors extension – II: Add to InitialiseSettings, default off [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793500 (https://phabricator.wikimedia.org/T306909) (owner: 10Jforrester) [20:10:05] (03PS4) 10Urbanecm: Add SimilarEditors extension – III: Add to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793501 (https://phabricator.wikimedia.org/T306909) (owner: 10Jforrester) [20:10:08] (03CR) 10Urbanecm: [C: 03+2] Add SimilarEditors extension – III: Add to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793501 (https://phabricator.wikimedia.org/T306909) (owner: 10Jforrester) [20:10:27] (03Merged) 10jenkins-bot: Add SimilarEditors extension – I: Add to i18n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789250 (https://phabricator.wikimedia.org/T306909) (owner: 10Tchanders) [20:10:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T60674)', diff saved to https://phabricator.wikimedia.org/P29262 and previous config saved to /var/cache/conftool/dbconfig/20220531-201051-ladsgroup.json [20:10:52] (03Merged) 10jenkins-bot: Add SimilarEditors extension – II: Add to InitialiseSettings, default off [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793500 (https://phabricator.wikimedia.org/T306909) (owner: 10Jforrester) [20:10:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1178.eqiad.wmnet with reason: Maintenance [20:10:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1178.eqiad.wmnet with reason: Maintenance [20:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:58] (03Merged) 10jenkins-bot: Add SimilarEditors extension – III: Add to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793501 (https://phabricator.wikimedia.org/T306909) (owner: 10Jforrester) [20:10:59] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [20:10:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1178 (T60674)', diff saved to https://phabricator.wikimedia.org/P29263 and previous config saved to /var/cache/conftool/dbconfig/20220531-201059-ladsgroup.json [20:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:48] !log urbanecm@deploy1002 Synchronized wmf-config/extension-list: b88690ee8ccbc50a51c6ef9dcdcbe3faecc3170f: Add SimilarEditors extension – I: Add to i18n (T306909) (duration: 03m 07s) [20:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:53] T306909: Prerequisites to deploying SimilarEditors to the beta cluster - https://phabricator.wikimedia.org/T306909 [20:16:29] (03CR) 10Legoktm: [C: 04-1] cgroup: Add different package for Bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/800855 (https://phabricator.wikimedia.org/T309449) (owner: 10Samtar) [20:17:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:31] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - bking@cumin1001 - T309648 [20:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:36] T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648 [20:17:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:17:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:20] (03CR) 10Urbanecm: [C: 04-1] "-1. Unmet dependency." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799012 (https://phabricator.wikimedia.org/T306908) (owner: 10AGueyte) [20:19:56] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 97551748271963ea59bd4f28b2fb1a1b48dd0c29: Add SimilarEditors extension – II: Add to InitialiseSettings, default off (T306909) (duration: 03m 06s) [20:20:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:03] T306909: Prerequisites to deploying SimilarEditors to the beta cluster - https://phabricator.wikimedia.org/T306909 [20:20:49] AnaisGueyte: Tran: unfortunately, I won't be able to deploy the final patch today as-is. the patch tries to enable SimilarEditors at all of beta cluster, but not all wikis of beta cluster have QuickSurveys (which is marked as a hard dependency by SimilarEditors). can one of you please have a look at it? [20:20:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T309311)', diff saved to https://phabricator.wikimedia.org/P29264 and previous config saved to /var/cache/conftool/dbconfig/20220531-202058-ladsgroup.json [20:21:03] (hopefully my comment makes sense there, let me know if something's not clear) [20:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:06] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [20:22:38] Thanks @urbanecm, havin a look rn [20:22:56] thanks. let me know if i can help :) [20:23:06] !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: 660afb14144db36d94deccae59d4ead096e10e5f: Add SimilarEditors extension – III: Add to CommonSettings (T306909) (duration: 03m 09s) [20:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:13] (03PS2) 10Samtar: cgroup: Add different package for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/800855 (https://phabricator.wikimedia.org/T309449) [20:23:50] (03CR) 10Samtar: cgroup: Add different package for Bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/800855 (https://phabricator.wikimedia.org/T309449) (owner: 10Samtar) [20:24:10] Hm...so would the solution be to enable QuickSurveys on all beta wikis? Is that okay? I see that it's only on en beta atm [20:25:16] you can also only enable the extension on wikis where the dependency is enabled by if ( $wmgUseSimilarUsers && $wmgUseQuickSurveys ) { wfLoadExtension( ... ) } [20:25:46] that seems a bit much, considering we only currently expect this extension to be viable on enwiki atm. I think I agree with @legoktm but I guess we won't be able to finish deploying today [20:25:48] I would recommend that type of conditional regardless of whether you expand the wikis where QuickSurveys [20:25:55] ...is deployed [20:26:44] Tran: if the gain is to have it only at beta enwiki, perhaps we can change `'default' => true` with `'enwiki' => true` in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/799012? [20:27:02] (that'd be fine with me) [20:27:21] s/gain/goal/ [20:27:22] that works for me [20:27:30] 'default' => true [20:27:35] AnaisGueyte can you update the patch? [20:28:08] (03PS3) 10Legoktm: mediawiki: Use non-transitional cgroups package for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/800855 (https://phabricator.wikimedia.org/T309449) (owner: 10Samtar) [20:28:36] Tran: note that the fact QuickSurveys is a hard dependency will likely be a problem for wide production deployment. AFAIK Performance doesn't really want to have it deployed everywhere, because it slows us down [20:29:04] (03PS6) 10AGueyte: Deploy SimilarEditors to the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799012 (https://phabricator.wikimedia.org/T306908) [20:29:23] (03CR) 10Legoktm: [C: 03+1] "I edited the commit message a bit, LGTM, should be a functional no-op on all buster systems!" [puppet] - 10https://gerrit.wikimedia.org/r/800855 (https://phabricator.wikimedia.org/T309449) (owner: 10Samtar) [20:29:56] (yay for beta cluster identifying potential production issues!!) [20:30:01] @urbanecm That makes sense. It'd be great to get this onto beta for now so we can start testing but I will bring this up as a prod blocker w/the rest of the team. [20:30:02] I do love the word "should" [20:30:28] followed closely by "it shouldn't be doing *that* why is it doing *that*..." [20:30:42] the patch has been updated [20:30:47] thank you! [20:30:56] AnaisGueyte: Tran: I'm a bit confused. i thought we agreed on enabling the extension on enwiki beta only for now, but the patch instead enables QuickSurvey everywhere? [20:32:49] (03PS7) 10AGueyte: Deploy SimilarEditors to the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799012 (https://phabricator.wikimedia.org/T306908) [20:33:08] My bad, it's re-updated now [20:33:12] thank you! [20:33:13] TheresNoTime: bug me again in a few days if no one has merged that. btw, you should be able to use https://debmonitor.wikimedia.org/ to check that all MW hosts already have cgroup-tools installed [20:33:57] I can, not that I have any idea what this is :') [20:34:17] (03CR) 10Urbanecm: Deploy SimilarEditors to the beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799012 (https://phabricator.wikimedia.org/T306908) (owner: 10AGueyte) [20:34:21] (03PS8) 10Urbanecm: Deploy SimilarEditors to the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799012 (https://phabricator.wikimedia.org/T306908) (owner: 10AGueyte) [20:34:47] (03CR) 10Urbanecm: [C: 03+2] "Should work fine now :-)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799012 (https://phabricator.wikimedia.org/T306908) (owner: 10AGueyte) [20:35:11] it aggregates all the packages and their versions installed on every host (and some docker images) [20:35:39] Smart :) https://debmonitor.wikimedia.org/packages/cgroup-bin suggests there's a "few" which aren't using "-tools" [20:36:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P29265 and previous config saved to /var/cache/conftool/dbconfig/20220531-203603-ladsgroup.json [20:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:26] hm? all hosts with -bin should have -tools too [20:36:29] (03Merged) 10jenkins-bot: Deploy SimilarEditors to the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799012 (https://phabricator.wikimedia.org/T306908) (owner: 10AGueyte) [20:37:43] ah, yes, didn't check that :) [20:37:44] AnaisGueyte: so, in theory, the extension should be available on beta enwiki soon (within 30 minutes or less). [20:37:55] Thank you! [20:38:10] looking at the last two patches now :) [20:38:27] (03PS4) 10Urbanecm: Assign similareditors right to the checkuser group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799378 (https://phabricator.wikimedia.org/T307205) (owner: 10AGueyte) [20:38:35] (03CR) 10Urbanecm: [C: 03+2] Assign similareditors right to the checkuser group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799378 (https://phabricator.wikimedia.org/T307205) (owner: 10AGueyte) [20:39:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:30] (03Merged) 10jenkins-bot: Assign similareditors right to the checkuser group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799378 (https://phabricator.wikimedia.org/T307205) (owner: 10AGueyte) [20:40:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:40:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:23] (03PS5) 10Urbanecm: Add QuickSurveys survey for the SimilarEditors feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787793 (https://phabricator.wikimedia.org/T307025) (owner: 10Tchanders) [20:40:27] (03CR) 10Urbanecm: [C: 03+2] Add QuickSurveys survey for the SimilarEditors feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787793 (https://phabricator.wikimedia.org/T307025) (owner: 10Tchanders) [20:40:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T60674)', diff saved to https://phabricator.wikimedia.org/P29266 and previous config saved to /var/cache/conftool/dbconfig/20220531-204054-ladsgroup.json [20:40:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:02] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [20:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:17] * urbanecm is wondering what the large number means in scap's `Running '/usr/local/sbin/check-and-restart-php php7.2-fpm 9223372036854775807' on 315 host(s)` [20:41:27] (03Merged) 10jenkins-bot: Add QuickSurveys survey for the SimilarEditors feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787793 (https://phabricator.wikimedia.org/T307025) (owner: 10Tchanders) [20:43:19] !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: 51f314f49e2df6ec7784b5200535bb45d8bec154: Assign similareditors right to the checkuser group (T307205) (duration: 03m 06s) [20:43:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:26] T307205: Assign similareditors right to the checkuser group - https://phabricator.wikimedia.org/T307205 [20:43:39] AnaisGueyte: Tran: ok, so we should be all set now :). anything else? [20:43:49] Thanks! [20:43:51] (03PS1) 10Urbanecm: Do not load SimilarEditors if QuickSurveys is not installed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801811 (https://phabricator.wikimedia.org/T306909) [20:43:57] That should be it, thanks! [20:44:09] happy to help! [20:45:03] legoktm: hi, mind quickly reviewing 801811? I'd like to push it out too, just to be safe(r). [20:45:22] (03CR) 10STran: [C: 03+1] Do not load SimilarEditors if QuickSurveys is not installed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801811 (https://phabricator.wikimedia.org/T306909) (owner: 10Urbanecm) [20:45:30] or Tran, thanks :) [20:46:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:11] (03CR) 10Urbanecm: [C: 03+2] Do not load SimilarEditors if QuickSurveys is not installed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801811 (https://phabricator.wikimedia.org/T306909) (owner: 10Urbanecm) [20:46:55] (03PS2) 10Urbanecm: Do not load SimilarEditors if QuickSurveys is not installed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801811 (https://phabricator.wikimedia.org/T306909) [20:47:00] (03CR) 10Urbanecm: [C: 03+2] Do not load SimilarEditors if QuickSurveys is not installed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801811 (https://phabricator.wikimedia.org/T306909) (owner: 10Urbanecm) [20:47:07] (03PS3) 10Urbanecm: Do not load SimilarEditors if QuickSurveys is not installed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801811 (https://phabricator.wikimedia.org/T306909) [20:47:14] (03CR) 10Legoktm: [C: 03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801811 (https://phabricator.wikimedia.org/T306909) (owner: 10Urbanecm) [20:47:22] (03CR) 10Urbanecm: [C: 03+2] Do not load SimilarEditors if QuickSurveys is not installed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801811 (https://phabricator.wikimedia.org/T306909) (owner: 10Urbanecm) [20:48:25] (03Merged) 10jenkins-bot: Do not load SimilarEditors if QuickSurveys is not installed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801811 (https://phabricator.wikimedia.org/T306909) (owner: 10Urbanecm) [20:51:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P29267 and previous config saved to /var/cache/conftool/dbconfig/20220531-205108-ladsgroup.json [20:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:08] !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: eaab90234cc7fac8d2cf459c1e959bdbe186e4fa: Do not load SimilarEditors if QuickSurveys is not installed (T306909) (duration: 03m 12s) [20:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:14] T306909: Prerequisites to deploying SimilarEditors to the beta cluster - https://phabricator.wikimedia.org/T306909 [20:52:24] !log UTC late backport window completed [20:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:52:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:38] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:54:16] Hi @urbanecm, there's an Uncaught Exception when accessing en beta wiki. about ExtensionRegistry. Is there a bug we have missed? [20:54:27] AnaisGueyte: hey, let me see! [20:54:48] Unable to open file /srv/mediawiki/php-master/extensions/SimilarEditors/extension.json? that...sounds like the extension code is not there [20:54:53] checking... [20:56:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P29268 and previous config saved to /var/cache/conftool/dbconfig/20220531-205559-ladsgroup.json [20:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:37] something's wrong on beta with submodules. let me fix that manually. [20:57:43] Thank you [20:59:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:59:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:36] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1004 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:01:46] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1003 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f4ff6488280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech [21:01:46] ia.org/wiki/Search%23Administration [21:01:58] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch_6@relforge-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:02:23] AnaisGueyte: ok, so it actually is not a beta issue, rather, it's a weird issue with this patch: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/+/789206. let me upload a fix of it. [21:04:02] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 137, active_shards: 274, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max [21:04:02] _in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:04:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:14] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:05:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudvirt105[123].eqiad.wmnet - https://phabricator.wikimedia.org/T305194 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson cloudvirt1051 e4 u28 Cableid 20220056 ; 34 port : Cableid. 20220055 ; 35 por... [21:05:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudvirt105[123].eqiad.wmnet - https://phabricator.wikimedia.org/T305194 (10Jclark-ctr) [21:06:04] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1004 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 137, active_shards: 274, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max [21:06:04] _in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:06:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T309311)', diff saved to https://phabricator.wikimedia.org/P29269 and previous config saved to /var/cache/conftool/dbconfig/20220531-210613-ladsgroup.json [21:06:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [21:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [21:06:20] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [21:06:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 10 hosts with reason: Maintenance [21:06:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 10 hosts with reason: Maintenance [21:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:36] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:31] AnaisGueyte: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/+/801814/ should fix the issue. not sure if you feel comfortable reviewing it though :). if you do, that'd be great. alternatively, we can also leave that to someone else (and until the patch gets merged, revert the beta deployment patch to bring enwiki beta back). what do you prfer? [21:08:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:08:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:28] Thanks for looking into this, what's the impact of this change? Is this affecting a previous patch? [21:11:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P29270 and previous config saved to /var/cache/conftool/dbconfig/20220531-211105-ladsgroup.json [21:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:23] AnaisGueyte: so, the previous patch (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/+/789206) did not add a submodule in a way recognized by git. instead, it treated the meant-to-be submodule as a regular file rather than a submodule. [21:12:25] my patch fixes that [21:12:44] Ouh nice, thanks let's go forward with it [21:12:52] you can test that by cloning mediawiki/extensions from gerrit, applying my patch and running git `submodule update --init SimilarEditors` [21:13:19] it should say something like this https://www.irccloud.com/pastebin/rUdL4tx9/ [21:13:35] well, and i see legoktm just merged it (thanks) [21:13:47] once it arrives to beta (~30 minutes), it should unbreak magically. [21:14:00] you can log into jenkins and trigger the beta update manually instead of waiting [21:14:23] i'd still need to wait for scap sync-world though, wouldn't i? [21:14:30] https://integration.wikimedia.org/ci/view/Beta/job/beta-code-update-eqiad/394069/console ran on the timer, well timed! [21:14:50] urbanecm: production deploy done? [21:14:56] Krinkle: yes. [21:15:16] urbanecm: trigger that job manually too :p [21:15:24] i meant for its completion :)) [21:15:40] I think so [21:15:49] I'm a bit foggy on how it all works these days [21:16:04] its running already (https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-sync-world/53499/console) [21:16:09] yup yup [21:16:26] Thank you! [21:17:17] (03CR) 10Krinkle: [C: 03+2] Follow-up I1dee51009: Add url() to list-style-image [skins/Vector] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/801193 (https://phabricator.wikimedia.org/T309374) (owner: 10Jforrester) [21:17:23] (03CR) 10Krinkle: [C: 03+2] Follow-up I8d62aedb: Fix .rotation mixin [core] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/800696 (owner: 10Krinkle) [21:18:30] you'd think https://en.wikipedia.beta.wmflabs.org/wiki/Main_Page 500ing would raise an alert *somewhere* [21:19:53] TheresNoTime: possibly in #wikimedia-releng [21:20:02] it's back up! 🎉 thanks again! [21:20:07] happy to help! [21:20:24] beta would greatly benefit from T215217. clear place for alerts's one of the benefits :)) [21:20:24] T215217: deployment-prep: Code stewardship request - https://phabricator.wikimedia.org/T215217 [21:21:09] Seeing en beta wiki is back up and similar editors shows on Special:Version and Special:SimilarEditors [21:21:15] thank you for your help! [21:21:24] sounds like good news to me :). [21:26:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T60674)', diff saved to https://phabricator.wikimedia.org/P29271 and previous config saved to /var/cache/conftool/dbconfig/20220531-212610-ladsgroup.json [21:26:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1167.eqiad.wmnet with reason: Maintenance [21:26:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1167.eqiad.wmnet with reason: Maintenance [21:26:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [21:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:18] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [21:26:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [21:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T60674)', diff saved to https://phabricator.wikimedia.org/P29272 and previous config saved to /var/cache/conftool/dbconfig/20220531-212623-ladsgroup.json [21:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:20] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [21:28:49] (03PS1) 10Papaul: Add bakup2009 to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/801815 (https://phabricator.wikimedia.org/T307049) [21:32:16] (03Merged) 10jenkins-bot: Follow-up I1dee51009: Add url() to list-style-image [skins/Vector] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/801193 (https://phabricator.wikimedia.org/T309374) (owner: 10Jforrester) [21:32:24] (03CR) 10Papaul: [C: 03+2] Add bakup2009 to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/801815 (https://phabricator.wikimedia.org/T307049) (owner: 10Papaul) [21:35:25] (03Merged) 10jenkins-bot: Follow-up I8d62aedb: Fix .rotation mixin [core] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/800696 (owner: 10Krinkle) [21:37:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:59] * Krinkle staging on mwdebug1002 [21:42:45] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host backup2009.codfw.wmnet with OS bullseye [21:42:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:53] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup, 10Patch-For-Review: Q4:(Need By: TBD) rack/setup/install backup2009 - https://phabricator.wikimedia.org/T307049 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host backup2009.codfw.wmnet with OS bull... [21:42:56] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:43:59] !log krinkle@deploy1002 Synchronized php-1.39.0-wmf.13/resources/src/mediawiki.less: I342384c822554 (duration: 03m 12s) [21:44:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:44:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:44:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:50] (03CR) 10Krinkle: [C: 03+2] "verified on mwdebug1002 with and wihtout, and deployed." [core] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/800696 (owner: 10Krinkle) [21:44:52] (03CR) 10Krinkle: [C: 03+2] "verified on mwdebug1002 with and wihtout, and deployed." [skins/Vector] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/801193 (https://phabricator.wikimedia.org/T309374) (owner: 10Jforrester) [21:45:08] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:45:34] (03PS1) 10Bartosz Dziewoński: Launch DiscussionTools topic subscriptions a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801818 (https://phabricator.wikimedia.org/T304029) [21:46:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [21:46:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [21:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T309311)', diff saved to https://phabricator.wikimedia.org/P29273 and previous config saved to /var/cache/conftool/dbconfig/20220531-214630-ladsgroup.json [21:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:38] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [21:47:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:47] !log krinkle@deploy1002 Synchronized php-1.39.0-wmf.13/skins/Vector/resources/: I91d690700cf (duration: 03m 23s) [21:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:58] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:52:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:53:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:46] PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:56:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:56:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:53] (03PS1) 10Bartosz Dziewoński: Make new topic tool available as opt-out almost everywhere (phase 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801820 (https://phabricator.wikimedia.org/T309368) [22:07:56] RECOVERY - SSH on restbase1018.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:09:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T60674)', diff saved to https://phabricator.wikimedia.org/P29274 and previous config saved to /var/cache/conftool/dbconfig/20220531-220930-ladsgroup.json [22:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:41] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [22:11:49] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:14:54] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:17:19] (03PS1) 10Papaul: Fix typo on backup2009 [puppet] - 10https://gerrit.wikimedia.org/r/801821 (https://phabricator.wikimedia.org/T307049) [22:18:34] (03CR) 10Papaul: [C: 03+2] Fix typo on backup2009 [puppet] - 10https://gerrit.wikimedia.org/r/801821 (https://phabricator.wikimedia.org/T307049) (owner: 10Papaul) [22:21:46] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:24:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P29275 and previous config saved to /var/cache/conftool/dbconfig/20220531-222436-ladsgroup.json [22:28:39] 10SRE, 10Sustainability (Incident Followup): get a legend for haproxy "anomalous session termination states" - https://phabricator.wikimedia.org/T308952 (10Dzahn) https://wikitech.wikimedia.org/wiki/HAProxy#Session_state_at_disconnection [22:28:55] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to mwmaint1002.eqiad.wmnet for sgimeno - https://phabricator.wikimedia.org/T309045 (10thcipriani) >>! In T309045#7961837, @Dzahn wrote: > @thcipriani Your approval is requested as group approver for "restricted" (just like for 'deployment').... [22:36:43] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on backup2009.codfw.wmnet with reason: host reimage [22:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T309311)', diff saved to https://phabricator.wikimedia.org/P29276 and previous config saved to /var/cache/conftool/dbconfig/20220531-223818-ladsgroup.json [22:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:24] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [22:39:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P29277 and previous config saved to /var/cache/conftool/dbconfig/20220531-223941-ladsgroup.json [22:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:54] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup2009.codfw.wmnet with reason: host reimage [22:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:49] (03PS5) 10Dzahn: admin: Add sgimeno to restricted [puppet] - 10https://gerrit.wikimedia.org/r/798667 (https://phabricator.wikimedia.org/T309045) (owner: 10Alexandros Kosiaris) [22:46:00] (03CR) 10Dzahn: "appoved by Tyler, needs rebase" [puppet] - 10https://gerrit.wikimedia.org/r/798667 (https://phabricator.wikimedia.org/T309045) (owner: 10Alexandros Kosiaris) [22:46:22] (03CR) 10Dzahn: [C: 03+2] admin: Add sgimeno to restricted [puppet] - 10https://gerrit.wikimedia.org/r/798667 (https://phabricator.wikimedia.org/T309045) (owner: 10Alexandros Kosiaris) [22:48:28] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to mwmaint1002.eqiad.wmnet for sgimeno - https://phabricator.wikimedia.org/T309045 (10Dzahn) [22:49:11] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to mwmaint1002.eqiad.wmnet for sgimeno - https://phabricator.wikimedia.org/T309045 (10Dzahn) p:05Triage→03High a:05thcipriani→03Dzahn [22:49:36] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to mwmaint1002.eqiad.wmnet for sgimeno - https://phabricator.wikimedia.org/T309045 (10Dzahn) approved by https://meta.wikimedia.org/wiki/User:SCherukuwada_%28WMF%29 in lieue of manager approved by Tyler as group approver deploying https://... [22:51:38] 10SRE, 10SRE-Access-Requests: Requesting access to contint-admins for taavi - https://phabricator.wikimedia.org/T309375 (10thcipriani) Sorry for the delay, I'll figure this out in the team meeting tomorrow! [22:53:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P29278 and previous config saved to /var/cache/conftool/dbconfig/20220531-225324-ladsgroup.json [22:53:30] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to mwmaint1002.eqiad.wmnet for sgimeno - https://phabricator.wikimedia.org/T309045 (10Dzahn) 05In progress→03Resolved Hello @Sgs your user account has been added to the mwmaint* servers (mwmaint1002 in eqiad, mwmaint2002 in codfw). S... [22:53:38] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup2009.codfw.wmnet with OS bullseye [22:53:41] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q4:(Need By: TBD) rack/setup/install backup2009 - https://phabricator.wikimedia.org/T307049 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host backup2009.codfw.wmnet with OS bullseye completed: - backup20... [22:54:44] 10SRE, 10Sustainability (Incident Followup): get a legend for haproxy "anomalous session termination states" - https://phabricator.wikimedia.org/T308952 (10Dzahn) 05Open→03In progress [22:54:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T60674)', diff saved to https://phabricator.wikimedia.org/P29279 and previous config saved to /var/cache/conftool/dbconfig/20220531-225446-ladsgroup.json [22:54:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1177.eqiad.wmnet with reason: Maintenance [22:54:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1177.eqiad.wmnet with reason: Maintenance [22:54:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1177 (T60674)', diff saved to https://phabricator.wikimedia.org/P29280 and previous config saved to /var/cache/conftool/dbconfig/20220531-225454-ladsgroup.json [22:55:12] RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:00:06] 10SRE, 10SRE-OnFire, 10Release-Engineering-Team, 10Sustainability: Remove old scap repositories from deploy1002 - https://phabricator.wikimedia.org/T309162 (10Dzahn) @jcrespo You are correct. In that case I still don't understand what this ticket is really asking for, first I thought it was about both depl... [23:08:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P29281 and previous config saved to /var/cache/conftool/dbconfig/20220531-230829-ladsgroup.json [23:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:44] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:14:18] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:19:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T60674)', diff saved to https://phabricator.wikimedia.org/P29282 and previous config saved to /var/cache/conftool/dbconfig/20220531-231933-ladsgroup.json [23:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:39] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [23:20:55] !log gitlab2001 - systemctl reset-failed [23:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T309311)', diff saved to https://phabricator.wikimedia.org/P29283 and previous config saved to /var/cache/conftool/dbconfig/20220531-232334-ladsgroup.json [23:23:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [23:23:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [23:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:41] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [23:23:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T309311)', diff saved to https://phabricator.wikimedia.org/P29284 and previous config saved to /var/cache/conftool/dbconfig/20220531-232342-ladsgroup.json [23:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:23] (03PS1) 10Dzahn: imagecatalog: do not have auto_restart service if service does not exist [puppet] - 10https://gerrit.wikimedia.org/r/801829 [23:25:44] (03CR) 10Aaron Schulz: [WIP] Implement MediaWiki multi-DC traffic component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) (owner: 10Tim Starling) [23:25:54] RECOVERY - Check systemd state on gitlab2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:26:44] ACKNOWLEDGEMENT - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_imagecatalog.service daniel_zahn https://gerrit.wikimedia.org/r/c/operations/puppet/+/801829 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:27:20] 10SRE, 10LDAP-Access-Requests: Grant Access to nda for ozhang - https://phabricator.wikimedia.org/T309559 (10dr0ptp4kt) Approved, end date for this access1-October-2022. The 'non-private' access level probably suffices - assuming the topic datasets are accessible there, which is the area of primary interest (t... [23:29:21] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1003/35634/ ; https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=deploy2002" [puppet] - 10https://gerrit.wikimedia.org/r/801829 (owner: 10Dzahn) [23:31:33] (03CR) 10Krinkle: [WIP] Implement MediaWiki multi-DC traffic component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) (owner: 10Tim Starling) [23:32:54] RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:33:13] !log deploy2002 systemctl reset-failed after deploying gerrit:801829 fixed alert about broken systemd state [23:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:26] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "<+icinga-wm> RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia." [puppet] - 10https://gerrit.wikimedia.org/r/801829 (owner: 10Dzahn) [23:34:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P29285 and previous config saved to /var/cache/conftool/dbconfig/20220531-233438-ladsgroup.json [23:34:39] 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 5 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10aaron) keyname VARBINARY(255) DEFAULT '' NOT NULL, value MEDIUMBLOB DEFAULT NULL, exptime BINARY(14) NOT NU... [23:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:24] 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 5 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Krinkle) [23:38:25] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [23:40:14] (03PS2) 10Krinkle: Switch wgMainStash to db-mainstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799433 (https://phabricator.wikimedia.org/T212129) (owner: 10Tim Starling) [23:41:07] 10SRE, 10DBA, 10MW-1.39-notes (1.39.0-wmf.14; 2022-05-30), 10Patch-For-Review, and 2 others: App servers <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809 (10Krinkle) [23:42:10] 10SRE, 10DBA, 10Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 (10Krinkle) [23:42:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T309311)', diff saved to https://phabricator.wikimedia.org/P29286 and previous config saved to /var/cache/conftool/dbconfig/20220531-234224-ladsgroup.json [23:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:33] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [23:47:47] (03CR) 10Aaron Schulz: [WIP] Implement MediaWiki multi-DC traffic component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) (owner: 10Tim Starling) [23:49:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P29287 and previous config saved to /var/cache/conftool/dbconfig/20220531-234943-ladsgroup.json [23:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:52:38] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:57:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P29288 and previous config saved to /var/cache/conftool/dbconfig/20220531-235729-ladsgroup.json [23:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:11] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q4:(Need By: TBD) rack/setup/install backup2009 - https://phabricator.wikimedia.org/T307049 (10Papaul)