[00:09:40] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1187.eqiad.wmnet with OS bullseye [00:09:46] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host db1187.eqiad.wmnet with OS bullseye completed: - db1... [00:12:55] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Papaul) @Marostegui I re-imaged db1187 it is back online. I can not access db1185 so i asked @Jclark-ctr to check the cable. ` pt1979@cumin1001:~$... [00:19:12] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:22:26] 10SRE, 10SRE-OnFire, 10Observability-Alerting: vopsbot's home directory doesn't get created - https://phabricator.wikimedia.org/T315568 (10Dzahn) Thank you! I think you could not reproduce the issue because meanwhile the user had been created by user{}. I will test it on phab1004 soon. [00:56:44] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [00:58:54] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 5 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [01:00:16] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:03:41] (03PS1) 10Andrew Bogott: Dumps servers: allow rsyncing to/from new clouddumps hosts. [puppet] - 10https://gerrit.wikimedia.org/r/825441 (https://phabricator.wikimedia.org/T302981) [01:04:06] 10SRE, 10Performance-Team (Radar): Set MW appserver scaling_governor to performance - https://phabricator.wikimedia.org/T315398 (10tstarling) >>! In T315398#8167048, @ori wrote: > AFAICT the sysfs interface (`/sys/devices/system/cpu/cpu*/cpufreq/energy_performance_preference`) can only be used to select one of... [01:04:44] (03CR) 10Andrew Bogott: [C: 03+2] Dumps servers: allow rsyncing to/from new clouddumps hosts. [puppet] - 10https://gerrit.wikimedia.org/r/825441 (https://phabricator.wikimedia.org/T302981) (owner: 10Andrew Bogott) [01:06:46] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:11:40] !log on mw1411, mw1413, mw1419, mw1429, mw1431, mw1433: set scaling_governor to powersave and energy_performance_preference to performance [01:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:16:08] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:21:03] 10SRE, 10Performance-Team (Radar): Set MW appserver scaling_governor to performance - https://phabricator.wikimedia.org/T315398 (10tstarling) Something (puppet?) is randomly setting energy_performance_preference back to balance_performance after I set it to performance. [01:25:30] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:30:10] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:36:45] (JobUnavailable) firing: (3) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:39:34] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:40:01] 10SRE, 10Performance-Team (Radar): Set MW appserver scaling_governor to performance - https://phabricator.wikimedia.org/T315398 (10tstarling) Probably not puppet or anything in the userspace. I did a manual puppet run on mw1411, and there was no change. Some time between 01:27 and 01:35, it changed to balance_... [01:41:19] !log on mw1411, mw1413, mw1419, mw1429, mw1431, mw1433: set energy_performance_preference to balance_performance T315398 [01:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:41:24] T315398: Set MW appserver scaling_governor to performance - https://phabricator.wikimedia.org/T315398 [01:41:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:48:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Andrew) I am now running the epic rsync from labstore1006 to clouddumps100[12]. Going to take a while! [01:51:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220823T0200) [02:00:40] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:02:09] 10SRE, 10Performance-Team (Radar): Set MW appserver scaling_governor to performance - https://phabricator.wikimedia.org/T315398 (10tstarling) balance_performance is in use in the control group of servers, so it shouldn't be surprising that the metrics are converging. {F35484606} [02:06:45] (JobUnavailable) resolved: (8) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:36] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.39.0-wmf.26 [core] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/825451 [02:07:42] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.39.0-wmf.26 [core] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/825451 (owner: 10TrainBranchBot) [02:08:02] PROBLEM - Check systemd state on ms-be2039 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:08:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:08:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:08:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:09:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:16:22] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [02:18:44] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 4 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [02:24:02] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:25:36] PROBLEM - SSH on analytics1076.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:26:50] (03Merged) 10jenkins-bot: Branch commit for wmf/1.39.0-wmf.26 [core] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/825451 (owner: 10TrainBranchBot) [02:29:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:30:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:30:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:30:52] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:31:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:39:36] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [02:46:40] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 11 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [02:54:13] (KubernetesRsyslogDown) firing: (2) rsyslog on dse-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:00:25] !log on wtp1025,wtp1027,wtp1029,wtp1031,wtp1033,wtp1035: set scaling_governor to performance T315398 [03:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:00:30] T315398: Set MW appserver scaling_governor to performance - https://phabricator.wikimedia.org/T315398 [03:01:48] RECOVERY - Check systemd state on ms-be2039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:03:31] (03PS8) 10Abijeet Patro: Enable message bundle on MetaWiki for WikiLearn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820869 (https://phabricator.wikimedia.org/T311587) [03:22:22] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:26:52] RECOVERY - SSH on analytics1076.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:31:50] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:53:08] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:00:12] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:07:44] PROBLEM - Check systemd state on ms-be2039 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:20:50] 10SRE, 10Performance-Team (Radar): Set MW appserver scaling_governor to performance - https://phabricator.wikimedia.org/T315398 (10tstarling) == Parsoid results == Since there's more going on in the control group, I added graphs of the differences. {F35484662} The hosts1 group was 22ms slower on average tha... [04:42:42] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:45:04] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:46:40] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [04:48:15] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Marostegui) Thank you Papaul! I can access db1187 just fine now! [04:49:02] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 4 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [04:50:51] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Marostegui) @Papaul it seems that db1186 and db1188 are no longer accessible now. [04:52:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [04:52:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [04:52:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T312972)', diff saved to https://phabricator.wikimedia.org/P32763 and previous config saved to /var/cache/conftool/dbconfig/20220823-045227-marostegui.json [04:52:33] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [04:53:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1131', diff saved to https://phabricator.wikimedia.org/P32764 and previous config saved to /var/cache/conftool/dbconfig/20220823-045322-root.json [04:53:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T312972)', diff saved to https://phabricator.wikimedia.org/P32765 and previous config saved to /var/cache/conftool/dbconfig/20220823-045334-marostegui.json [04:55:15] (03PS1) 10Marostegui: install_server: Do not reimage db1195 [puppet] - 10https://gerrit.wikimedia.org/r/825620 (https://phabricator.wikimedia.org/T315856) [04:56:33] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/825620 (https://phabricator.wikimedia.org/T315856) (owner: 10Marostegui) [04:57:34] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1195 [puppet] - 10https://gerrit.wikimedia.org/r/825620 (https://phabricator.wikimedia.org/T315856) (owner: 10Marostegui) [04:59:23] (03PS1) 10Marostegui: mariadb: Productionize db1187 [puppet] - 10https://gerrit.wikimedia.org/r/825628 (https://phabricator.wikimedia.org/T313569) [05:02:08] RECOVERY - Check systemd state on ms-be2039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:02:32] PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The following units failed: spamassassin_updates.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:08:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P32767 and previous config saved to /var/cache/conftool/dbconfig/20220823-050840-marostegui.json [05:23:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P32768 and previous config saved to /var/cache/conftool/dbconfig/20220823-052346-marostegui.json [05:24:36] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:24:42] 10SRE, 10Performance-Team (Radar): Set MW appserver scaling_governor to performance - https://phabricator.wikimedia.org/T315398 (10tstarling) == Cost/benefit analysis == From the [[https://upload.wikimedia.org/wikipedia/commons/0/00/Wikimedia_Foundation_Environmental_Sustainability_%28Carbon_Footprint%29_Repo... [05:38:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T312972)', diff saved to https://phabricator.wikimedia.org/P32769 and previous config saved to /var/cache/conftool/dbconfig/20220823-053852-marostegui.json [05:38:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [05:38:58] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [05:39:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [05:39:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [05:39:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [05:39:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T312972)', diff saved to https://phabricator.wikimedia.org/P32770 and previous config saved to /var/cache/conftool/dbconfig/20220823-053929-marostegui.json [05:46:37] 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Undocumented IP on WMCS network - https://phabricator.wikimedia.org/T315955 (10ayounsi) [05:57:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T312972)', diff saved to https://phabricator.wikimedia.org/P32771 and previous config saved to /var/cache/conftool/dbconfig/20220823-055739-marostegui.json [05:57:45] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [05:58:26] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:00:05] kormat, marostegui, and Amir1: OwO what's this, a deployment window?? Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220823T0600). nyaa~ [06:00:48] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:06:06] PROBLEM - Check systemd state on ms-be2039 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:10:18] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:12:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P32772 and previous config saved to /var/cache/conftool/dbconfig/20220823-061245-marostegui.json [06:15:02] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:18:07] (03PS2) 10KartikMistry: Update cxserver to 2022-08-22-093815-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/825336 (https://phabricator.wikimedia.org/T308248) [06:18:28] Amir1: OK to update cxserver or you're on db maintainance? [06:19:29] kart_: I don't have anything on x1. Maybe marostegui does [06:19:57] OK. I'll wait for window to over then. [06:20:42] nope [06:20:45] not me [06:21:51] kart_: generally https://wikitech.wikimedia.org/wiki/Map_of_database_maintenance is made for these occasions (we DBAs lost track of who does where, that's why we have this map) [06:26:28] (03PS1) 10Abijeet Patro: Add declarations for TranslatablePage in extension.json [extensions/Translate] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825284 (https://phabricator.wikimedia.org/T315889) [06:27:39] Amir1: Tha's useful. Thanks! [06:27:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P32773 and previous config saved to /var/cache/conftool/dbconfig/20220823-062751-marostegui.json [06:32:23] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2022-08-22-093815-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/825336 (https://phabricator.wikimedia.org/T308248) (owner: 10KartikMistry) [06:37:15] (03Merged) 10jenkins-bot: Update cxserver to 2022-08-22-093815-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/825336 (https://phabricator.wikimedia.org/T308248) (owner: 10KartikMistry) [06:38:31] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [06:39:06] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [06:41:32] (03PS1) 10Vivian Rook: Allow cloud_provider_enabled [puppet] - 10https://gerrit.wikimedia.org/r/825676 (https://phabricator.wikimedia.org/T280792) [06:41:58] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2019.codfw.wmnet with reason: Remove node for eventual reimage, T311686 [06:42:02] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [06:42:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2019.codfw.wmnet with reason: Remove node for eventual reimage, T311686 [06:42:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T312972)', diff saved to https://phabricator.wikimedia.org/P32774 and previous config saved to /var/cache/conftool/dbconfig/20220823-064257-marostegui.json [06:42:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [06:43:01] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [06:43:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [06:43:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1163 (T312972)', diff saved to https://phabricator.wikimedia.org/P32775 and previous config saved to /var/cache/conftool/dbconfig/20220823-064318-marostegui.json [06:44:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T312972)', diff saved to https://phabricator.wikimedia.org/P32776 and previous config saved to /var/cache/conftool/dbconfig/20220823-064425-marostegui.json [06:45:09] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [06:45:47] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [06:48:23] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [06:49:21] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [06:49:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2019.codfw.wmnet with OS bullseye [06:49:55] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2019.codfw.wmnet with OS bullseye [06:50:31] !log Updated cxserver to 2022-08-22-093815-production (T308248, T308371) [06:50:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:36] T308371: Migrate node-based services in production to node16 - https://phabricator.wikimedia.org/T308371 [06:50:36] T308248: Newly supported languages in Google Translate - https://phabricator.wikimedia.org/T308248 [06:53:40] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:54:13] (KubernetesRsyslogDown) firing: (2) rsyslog on dse-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:59:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P32777 and previous config saved to /var/cache/conftool/dbconfig/20220823-065931-marostegui.json [07:00:05] Amir1 and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220823T0700). [07:00:05] abijeet: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:03:29] hello [07:06:50] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2019.codfw.wmnet with reason: host reimage [07:08:48] (03CR) 10JMeybohm: Add an entry for dse-k8s-ctrl to the service catalog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/825348 (https://phabricator.wikimedia.org/T310172) (owner: 10Btullis) [07:10:17] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:10:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2019.codfw.wmnet with reason: host reimage [07:10:52] (03PS1) 10Muehlenhoff: Disable Ganeti cluster rebalances temporarily [puppet] - 10https://gerrit.wikimedia.org/r/825678 (https://phabricator.wikimedia.org/T311686) [07:11:09] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.291 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:14:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P32778 and previous config saved to /var/cache/conftool/dbconfig/20220823-071437-marostegui.json [07:16:51] RECOVERY - Check systemd state on ms-be2039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:16:58] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. The monitoring entry refers to https://wikitech.wikimedia.org/wiki/PERCCli#Monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/812250 (https://phabricator.wikimedia.org/T297913) (owner: 10Slyngshede) [07:21:24] I have a couple of patches scheduled for the backport window [07:29:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T312972)', diff saved to https://phabricator.wikimedia.org/P32779 and previous config saved to /var/cache/conftool/dbconfig/20220823-072943-marostegui.json [07:29:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [07:29:49] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [07:29:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2019.codfw.wmnet with OS bullseye [07:29:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [07:30:00] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2019.codfw.wmnet with OS bullseye completed: - ganeti2019 (**PA... [07:30:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [07:30:11] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:30:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [07:30:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T312972)', diff saved to https://phabricator.wikimedia.org/P32780 and previous config saved to /var/cache/conftool/dbconfig/20220823-073020-marostegui.json [07:31:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T312972)', diff saved to https://phabricator.wikimedia.org/P32781 and previous config saved to /var/cache/conftool/dbconfig/20220823-073127-marostegui.json [07:37:21] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:41:42] (03CR) 10Jbond: [C: 03+2] R:systemd::sysuser: drop managehome parameter as it dosn;t work (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/824696 (https://phabricator.wikimedia.org/T315568) (owner: 10Jbond) [07:45:17] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:46:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P32783 and previous config saved to /var/cache/conftool/dbconfig/20220823-074633-marostegui.json [07:52:16] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1187 [puppet] - 10https://gerrit.wikimedia.org/r/825628 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [07:55:45] (03PS1) 10Muehlenhoff: Absent libsnmp30 on bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/825682 [07:56:20] (03PS1) 10Marostegui: site.pp: Regex for dbs insetup [puppet] - 10https://gerrit.wikimedia.org/r/825683 [07:58:00] (03PS2) 10Marostegui: site.pp: Regex for dbs insetup [puppet] - 10https://gerrit.wikimedia.org/r/825683 [07:58:51] (03CR) 10Marostegui: [C: 03+2] site.pp: Regex for dbs insetup [puppet] - 10https://gerrit.wikimedia.org/r/825683 (owner: 10Marostegui) [08:00:05] hashar and dduvall: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-0+Utc-7 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220823T0800). [08:01:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P32784 and previous config saved to /var/cache/conftool/dbconfig/20220823-080139-marostegui.json [08:01:48] (03CR) 10David Caro: [C: 03+2] rabbit.drain_queue: Don't fail if the queue has no messages [puppet] - 10https://gerrit.wikimedia.org/r/814726 (owner: 10David Caro) [08:01:59] (03PS2) 10David Caro: rabbit.drain_queue: Don't fail if the queue has no messages [puppet] - 10https://gerrit.wikimedia.org/r/814726 [08:03:02] o/ [08:03:07] jouncebot: now [08:03:07] For the next 1 hour(s) and 56 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220823T0800) [08:03:09] jouncebot: prev [08:03:13] ... [08:03:47] abijeet: sorry looks like nobody was around this morning to process patches :( [08:04:40] it happens sometimes unfortunately [08:05:21] hashar, no problem. I've rescheduled it for afternoob. [08:05:27] afternoon* [08:05:34] well [08:05:40] we can do it right now if you want [08:05:53] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/825682 (owner: 10Muehlenhoff) [08:05:58] time is blocked right now for the MediaWiki train and I am the one running it [08:06:09] OK, that would be great. I'll move them back [08:06:11] I can delay it since well it is 10am here and I have ample time to run it [08:06:21] PROBLEM - Check systemd state on ms-be2039 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service,swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:06:30] (03CR) 10Hashar: [C: 03+2] "For backporting" [extensions/Translate] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825284 (https://phabricator.wikimedia.org/T315889) (owner: 10Abijeet Patro) [08:07:04] (03CR) 10David Caro: [C: 03+2] p:ceph::osd: get the os disks by size [puppet] - 10https://gerrit.wikimedia.org/r/824422 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [08:07:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P32785 and previous config saved to /var/cache/conftool/dbconfig/20220823-080710-root.json [08:07:15] (03PS5) 10David Caro: p:ceph::osd: get the os disks by size [puppet] - 10https://gerrit.wikimedia.org/r/824422 (https://phabricator.wikimedia.org/T314870) [08:07:21] (03PS4) 10David Caro: ceph::osd: add new disks model to disable write caches for [puppet] - 10https://gerrit.wikimedia.org/r/824423 (https://phabricator.wikimedia.org/T314870) [08:07:37] the translate patch will take a while to merge, we can deploy the configuration change till then? [08:07:43] sure [08:08:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet [08:08:17] (03CR) 10Hashar: [C: 03+2] Enable message bundle on MetaWiki for WikiLearn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820869 (https://phabricator.wikimedia.org/T311587) (owner: 10Abijeet Patro) [08:08:34] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/823764 (https://phabricator.wikimedia.org/T315393) (owner: 10Andrea Denisse) [08:08:38] (03PS3) 10David Caro: wmcs.novafullstack: stop sending stats to statsd [puppet] - 10https://gerrit.wikimedia.org/r/814800 [08:09:06] (03Merged) 10jenkins-bot: Enable message bundle on MetaWiki for WikiLearn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820869 (https://phabricator.wikimedia.org/T311587) (owner: 10Abijeet Patro) [08:09:51] abijeet: the config patch is on mwdebug1001 if you wanna verify [08:09:57] thanks! checking [08:15:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:15:21] PROBLEM - Disk space on ms-be2039 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=89%): /tmp 0 MB (0% inode=89%): /var/tmp 0 MB (0% inode=89%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2039&var-datasource=codfw+prometheus/ops [08:15:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet [08:16:14] (03CR) 10Muehlenhoff: [C: 03+2] Disable Ganeti cluster rebalances temporarily [puppet] - 10https://gerrit.wikimedia.org/r/825678 (https://phabricator.wikimedia.org/T311686) (owner: 10Muehlenhoff) [08:16:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:16:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:16:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T312972)', diff saved to https://phabricator.wikimedia.org/P32786 and previous config saved to /var/cache/conftool/dbconfig/20220823-081645-marostegui.json [08:16:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [08:16:50] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [08:17:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [08:17:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T312972)', diff saved to https://phabricator.wikimedia.org/P32787 and previous config saved to /var/cache/conftool/dbconfig/20220823-081706-marostegui.json [08:17:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:17:51] (03PS1) 10Majavah: P:dumps: remove ipv4/ipv6 separation from internal_rsync_clients [puppet] - 10https://gerrit.wikimedia.org/r/825685 [08:18:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T312972)', diff saved to https://phabricator.wikimedia.org/P32788 and previous config saved to /var/cache/conftool/dbconfig/20220823-081813-marostegui.json [08:18:26] (03CR) 10CI reject: [V: 04-1] P:dumps: remove ipv4/ipv6 separation from internal_rsync_clients [puppet] - 10https://gerrit.wikimedia.org/r/825685 (owner: 10Majavah) [08:19:12] (03PS2) 10Majavah: P:dumps: remove ipv4/ipv6 separation from internal_rsync_clients [puppet] - 10https://gerrit.wikimedia.org/r/825685 [08:20:00] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36888/console" [puppet] - 10https://gerrit.wikimedia.org/r/825685 (owner: 10Majavah) [08:21:45] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/825682 (owner: 10Muehlenhoff) [08:21:50] (03CR) 10Majavah: P:dumps: remove ipv4/ipv6 separation from internal_rsync_clients [puppet] - 10https://gerrit.wikimedia.org/r/825685 (owner: 10Majavah) [08:22:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 2%: Repooling', diff saved to https://phabricator.wikimedia.org/P32789 and previous config saved to /var/cache/conftool/dbconfig/20220823-082215-root.json [08:22:32] hashar, that looks ok. Are you an administrator on Meta-Wiki? If so can you change the content model of this page to transalatable bundle? https://meta.wikimedia.org/w/index.php?title=User:APatro_(WMF)/TestMessageBundle-23Aug2022&action=info [08:23:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1162', diff saved to https://phabricator.wikimedia.org/P32790 and previous config saved to /var/cache/conftool/dbconfig/20220823-082336-root.json [08:24:16] (03Merged) 10jenkins-bot: Add declarations for TranslatablePage in extension.json [extensions/Translate] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825284 (https://phabricator.wikimedia.org/T315889) (owner: 10Abijeet Patro) [08:24:21] abijeet: maybe! checking [08:25:00] abijeet: well I don't know how to change a page model :) [08:25:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P32792 and previous config saved to /var/cache/conftool/dbconfig/20220823-082515-root.json [08:25:30] hashar, thanks, the content model you should see (when coinnecting via 1001 is message bundle) [08:25:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 1%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32793 and previous config saved to /var/cache/conftool/dbconfig/20220823-082545-root.json [08:25:51] hashar, Go here: https://meta.wikimedia.org/w/index.php?title=User:APatro_(WMF)/TestMessageBundle-23Aug2022&action=info and you should see a label: "Page content model" [08:26:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1166', diff saved to https://phabricator.wikimedia.org/P32794 and previous config saved to /var/cache/conftool/dbconfig/20220823-082605-root.json [08:26:31] that should be a dropdown where you can select other content model [08:26:50] Page content model `wikitext` [08:26:57] but I cant edit it, therefore I am not admin there [08:26:58] :( [08:27:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:28:07] or some rights are missing maybe [08:28:18] I was not administrator access to be required actually. I think we can leave the change in place though because I'm not seeing it cause any issues. [08:28:24] (03PS1) 10Marostegui: mariadb: Productionize db1189 [puppet] - 10https://gerrit.wikimedia.org/r/825706 (https://phabricator.wikimedia.org/T313569) [08:29:02] ah yeah I am a normal user on that wiki. Got them wiped back in 2008 as part of a routine cleanup [08:29:17] lets deploy it [08:29:23] thanks [08:29:48] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1189 [puppet] - 10https://gerrit.wikimedia.org/r/825706 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [08:30:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:30:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:31:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:31:15] (03PS1) 10Marostegui: install_server: Do not reimage db1187 [puppet] - 10https://gerrit.wikimedia.org/r/825707 (https://phabricator.wikimedia.org/T313569) [08:32:08] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1187 [puppet] - 10https://gerrit.wikimedia.org/r/825707 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [08:32:55] (03CR) 10Jbond: [C: 03+2] puppet_compiler: relocate to /srv/jenkins [puppet] - 10https://gerrit.wikimedia.org/r/825360 (https://phabricator.wikimedia.org/T309698) (owner: 10Hashar) [08:33:00] !log hashar@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820869|Enable message bundle on MetaWiki for WikiLearn (T311587)]] (duration: 03m 27s) [08:33:04] T311587: WikiLearn: Integration checklist for MetaWiki - https://phabricator.wikimedia.org/T311587 [08:33:09] abijeet: done [08:33:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P32796 and previous config saved to /var/cache/conftool/dbconfig/20220823-083319-marostegui.json [08:33:26] hmm [08:33:35] and now I see Page content model Translatable message bundle [08:33:57] I asked Amir to change it :) [08:34:35] lovely [08:34:44] it shows some raw json as a result [08:35:19] yes, its a new feature that we are working on to allow translation of raw json content. [08:35:24] meanwhile the Translate change has merged [08:35:37] yup, we can go ahead and test that [08:36:55] it is on mwdebug1001 now [08:38:02] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36889/console" [puppet] - 10https://gerrit.wikimedia.org/r/812250 (https://phabricator.wikimedia.org/T297913) (owner: 10Slyngshede) [08:38:19] (03PS1) 10Vgutierrez: trafficserver: Fix open_write_fail_action values [puppet] - 10https://gerrit.wikimedia.org/r/825709 (https://phabricator.wikimedia.org/T315911) [08:40:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 2%: Repooling', diff saved to https://phabricator.wikimedia.org/P32797 and previous config saved to /var/cache/conftool/dbconfig/20220823-084020-root.json [08:40:26] checking [08:40:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 2%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32798 and previous config saved to /var/cache/conftool/dbconfig/20220823-084050-root.json [08:41:42] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 0:20:00 on gitlab1004.wikimedia.org with reason: upgrade gitlab1004 to new version [08:41:55] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36890/console" [puppet] - 10https://gerrit.wikimedia.org/r/825709 (https://phabricator.wikimedia.org/T315911) (owner: 10Vgutierrez) [08:41:56] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on gitlab1004.wikimedia.org with reason: upgrade gitlab1004 to new version [08:42:32] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:43:16] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36891/console" [puppet] - 10https://gerrit.wikimedia.org/r/812250 (https://phabricator.wikimedia.org/T297913) (owner: 10Slyngshede) [08:44:11] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2019.codfw.wmnet to cluster codfw and group B [08:44:14] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2019.codfw.wmnet to cluster codfw and group B [08:45:05] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet [08:45:36] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:46:56] (03PS3) 10Btullis: Add an entry for dse-k8s-ctrl to the service catalog [puppet] - 10https://gerrit.wikimedia.org/r/825348 (https://phabricator.wikimedia.org/T310172) [08:47:21] (03PS2) 10Vgutierrez: trafficserver: Fix open_write_fail_action values [puppet] - 10https://gerrit.wikimedia.org/r/825709 (https://phabricator.wikimedia.org/T315911) [08:48:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P32799 and previous config saved to /var/cache/conftool/dbconfig/20220823-084826-marostegui.json [08:49:18] (03CR) 10Btullis: Add an entry for dse-k8s-ctrl to the service catalog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/825348 (https://phabricator.wikimedia.org/T310172) (owner: 10Btullis) [08:49:26] hashar, I'm still seeing the error this fix was supposed to address in the job queue but the server is a different one (mw1311, mw1335 etc). I'm assuming this should not be fixed once we release this fix? [08:49:59] I'm assuming this should be fixed once we release this fix?* [08:50:08] I imagine it's same as with twn, cannot test jobqueue with canary [08:50:08] (03CR) 10Jbond: [C: 03+2] raid_fact: Add new refactored raid fact [puppet] - 10https://gerrit.wikimedia.org/r/815287 (https://phabricator.wikimedia.org/T313312) (owner: 10Jbond) [08:50:30] (03PS2) 10Muehlenhoff: Absent libsnmp30 on bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/825682 [08:50:45] that makes sense, so we can release this then [08:52:22] ohh for the job queue yes [08:52:23] :( [08:52:37] I will sync it [08:53:22] (03PS1) 10Majavah: Remove few usages of ge('stretch') [puppet] - 10https://gerrit.wikimedia.org/r/825713 [08:53:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10BTullis) [08:54:32] (03CR) 10Jbond: "lgtm minor nit" [puppet] - 10https://gerrit.wikimedia.org/r/825369 (owner: 10Muehlenhoff) [08:54:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet [08:55:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P32800 and previous config saved to /var/cache/conftool/dbconfig/20220823-085525-root.json [08:55:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 5%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32801 and previous config saved to /var/cache/conftool/dbconfig/20220823-085554-root.json [08:56:21] !log hashar@deploy1002 Synchronized php-1.39.0-wmf.25/extensions/Translate/extension.json: Backport: [[gerrit:825284|Add declarations for TranslatablePage in extension.json (T315889)]] (duration: 03m 39s) [08:56:24] T315889: Error: Class 'TranslatablePage' not found - https://phabricator.wikimedia.org/T315889 [08:56:28] abijeet: deployed [08:57:15] thanks [08:59:16] (03CR) 10Muehlenhoff: [C: 03+2] Absent libsnmp30 on bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/825682 (owner: 10Muehlenhoff) [08:59:43] I see that the Translate patch worked as intended. Thanks again. [09:00:30] awesome [09:00:40] if any follow up is needed poke me [09:00:46] though I am going to run the mediawiki train now [09:00:55] sounds good [09:01:18] (03PS2) 10Muehlenhoff: Remove few usages of ge('stretch') [puppet] - 10https://gerrit.wikimedia.org/r/825713 (owner: 10Majavah) [09:03:31] (03PS1) 10TrainBranchBot: testwikis wikis to 1.39.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825715 (https://phabricator.wikimedia.org/T314187) [09:03:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T312972)', diff saved to https://phabricator.wikimedia.org/P32802 and previous config saved to /var/cache/conftool/dbconfig/20220823-090332-marostegui.json [09:03:33] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.39.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825715 (https://phabricator.wikimedia.org/T314187) (owner: 10TrainBranchBot) [09:03:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1128.eqiad.wmnet with reason: Maintenance [09:03:37] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [09:03:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1128.eqiad.wmnet with reason: Maintenance [09:03:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1128 (T312972)', diff saved to https://phabricator.wikimedia.org/P32803 and previous config saved to /var/cache/conftool/dbconfig/20220823-090353-marostegui.json [09:04:37] (03Merged) 10jenkins-bot: testwikis wikis to 1.39.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825715 (https://phabricator.wikimedia.org/T314187) (owner: 10TrainBranchBot) [09:05:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T312972)', diff saved to https://phabricator.wikimedia.org/P32804 and previous config saved to /var/cache/conftool/dbconfig/20220823-090500-marostegui.json [09:05:18] (03CR) 10JMeybohm: [C: 03+1] Add an entry for dse-k8s-ctrl to the service catalog [puppet] - 10https://gerrit.wikimedia.org/r/825348 (https://phabricator.wikimedia.org/T310172) (owner: 10Btullis) [09:05:30] !log btullis@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-test-eqiad cluster: Roll restart of jvm daemons. [09:05:36] !log hashar@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.26 refs T314187 [09:05:42] T314187: 1.39.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T314187 [09:06:03] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/825713 (owner: 10Majavah) [09:06:05] (03CR) 10Btullis: [C: 03+2] Add an entry for dse-k8s-ctrl to the service catalog [puppet] - 10https://gerrit.wikimedia.org/r/825348 (https://phabricator.wikimedia.org/T310172) (owner: 10Btullis) [09:06:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:07:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:07:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:08:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:10:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P32805 and previous config saved to /var/cache/conftool/dbconfig/20220823-091029-root.json [09:10:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 8%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32806 and previous config saved to /var/cache/conftool/dbconfig/20220823-091059-root.json [09:12:35] (03CR) 10JMeybohm: "Hey Jesse, may I lure you into another go review? 😊" [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/820713 (owner: 10JMeybohm) [09:13:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:13:45] (03Abandoned) 10JMeybohm: Install main version as istioctl plus bash completion [debs/istioctl] - 10https://gerrit.wikimedia.org/r/719040 (owner: 10JMeybohm) [09:14:04] (03PS2) 10Ayounsi: Add names to flow collectors [homer/public] - 10https://gerrit.wikimedia.org/r/822414 (https://phabricator.wikimedia.org/T313805) [09:14:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:14:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:15:08] (03CR) 10Ayounsi: "thanks!" [homer/public] - 10https://gerrit.wikimedia.org/r/822414 (https://phabricator.wikimedia.org/T313805) (owner: 10Ayounsi) [09:15:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:15:28] (03CR) 10Jbond: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/822414 (https://phabricator.wikimedia.org/T313805) (owner: 10Ayounsi) [09:16:51] (03PS6) 10Btullis: Add a new VIP for dse-k8s-ctrl.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/825329 (https://phabricator.wikimedia.org/T310196) [09:16:56] (03CR) 10Btullis: Add a new VIP for dse-k8s-ctrl.svc.eqiad.wmnet (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/825329 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [09:17:20] (03CR) 10Ayounsi: [C: 03+2] Add names to flow collectors [homer/public] - 10https://gerrit.wikimedia.org/r/822414 (https://phabricator.wikimedia.org/T313805) (owner: 10Ayounsi) [09:19:45] (03Merged) 10jenkins-bot: Add names to flow collectors [homer/public] - 10https://gerrit.wikimedia.org/r/822414 (https://phabricator.wikimedia.org/T313805) (owner: 10Ayounsi) [09:20:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P32807 and previous config saved to /var/cache/conftool/dbconfig/20220823-092006-marostegui.json [09:20:30] PROBLEM - Check systemd state on db1117 is CRITICAL: CRITICAL - degraded: The following units failed: mariadb.service,prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:21:28] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: sync [09:21:48] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync [09:22:48] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: sync [09:22:52] RECOVERY - Check systemd state on db1117 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:23:10] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: sync [09:23:24] PROBLEM - SSH on ms-be1041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:24:57] 10SRE, 10Traffic, 10Patch-For-Review: ATS Read While Writer feature is wrongly configured - https://phabricator.wikimedia.org/T315911 (10Vgutierrez) = Current Status = The current settings applied when `profile::trafficserver::backend::origin_coalescing` is set to `true` (default value) are: ` CONFIG proxy.... [09:25:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P32808 and previous config saved to /var/cache/conftool/dbconfig/20220823-092534-root.json [09:26:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:26:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 10%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32809 and previous config saved to /var/cache/conftool/dbconfig/20220823-092603-root.json [09:27:25] (03CR) 10Jbond: "LGTM, minor optimisation" [puppet] - 10https://gerrit.wikimedia.org/r/825685 (owner: 10Majavah) [09:28:58] 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Undocumented IP on WMCS network - https://phabricator.wikimedia.org/T315955 (10dcaro) I think that this might be the experiments that we have been doing with Magnum, ping @Andrew, @rook [09:29:18] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:29:50] (03CR) 10David Caro: [C: 03+2] ceph::osd: add new disks model to disable write caches for [puppet] - 10https://gerrit.wikimedia.org/r/824423 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [09:31:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:34:14] (03PS6) 10David Caro: wmcs: add ldap getent speed alerts [alerts] - 10https://gerrit.wikimedia.org/r/813915 [09:34:55] (03PS1) 10Ayounsi: We actually need both the name and IP here [homer/public] - 10https://gerrit.wikimedia.org/r/825719 (https://phabricator.wikimedia.org/T313805) [09:35:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P32810 and previous config saved to /var/cache/conftool/dbconfig/20220823-093512-marostegui.json [09:35:32] (03CR) 10CI reject: [V: 04-1] wmcs: add ldap getent speed alerts [alerts] - 10https://gerrit.wikimedia.org/r/813915 (owner: 10David Caro) [09:36:58] (03CR) 10Jbond: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/825719 (https://phabricator.wikimedia.org/T313805) (owner: 10Ayounsi) [09:37:01] (03CR) 10Ayounsi: [C: 03+2] We actually need both the name and IP here [homer/public] - 10https://gerrit.wikimedia.org/r/825719 (https://phabricator.wikimedia.org/T313805) (owner: 10Ayounsi) [09:38:42] (03Merged) 10jenkins-bot: We actually need both the name and IP here [homer/public] - 10https://gerrit.wikimedia.org/r/825719 (https://phabricator.wikimedia.org/T313805) (owner: 10Ayounsi) [09:39:34] (03CR) 10Vgutierrez: [C: 03+2] Incremental roll-out of query-sorting (5%) [puppet] - 10https://gerrit.wikimedia.org/r/825404 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori) [09:40:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:40:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P32811 and previous config saved to /var/cache/conftool/dbconfig/20220823-094039-root.json [09:40:40] !log Incremental roll-out of query-sorting (5%) - T314868 [09:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:46] T314868: Roll out query parameter normalization - https://phabricator.wikimedia.org/T314868 [09:41:09] !log hashar@deploy1002 Finished scap: testwikis wikis to 1.39.0-wmf.26 refs T314187 (duration: 35m 32s) [09:41:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 20%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32812 and previous config saved to /var/cache/conftool/dbconfig/20220823-094108-root.json [09:41:12] T314187: 1.39.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T314187 [09:42:00] (03PS2) 10David Caro: openstack.galera: add nodecheck logrotate config [puppet] - 10https://gerrit.wikimedia.org/r/809100 [09:42:53] !log add NAT rule for frdev1002 on pfw3-eqiad - T315579 [09:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:50] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36893/console" [puppet] - 10https://gerrit.wikimedia.org/r/809100 (owner: 10David Caro) [09:44:03] jbond: it looks like 7475cde50bbea5272dac9091682f1e90e8c0eeb3 added a duplicated key on pci_ids, 100010e2 and it's triggering a warning on puppet agent runs [09:44:31] the warning is pretty clear as well: /var/lib/puppet/lib/facter/raid.rb:9: warning: key "100010e2" is duplicated and overwritten on line 12 [09:45:28] (03PS1) 10Muehlenhoff: Update PCI ID list for new raid_mgmt_tools fact [puppet] - 10https://gerrit.wikimedia.org/r/825720 (https://phabricator.wikimedia.org/T313312) [09:46:18] that CR by moritzm seems to address the issue :) [09:47:01] vgutierrez: yeah, I had just been debugging this, but it took some time to get to the bottom of it [09:47:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:47:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:47:29] (03CR) 10David Caro: [V: 03+1 C: 03+2] openstack.galera: add nodecheck logrotate config [puppet] - 10https://gerrit.wikimedia.org/r/809100 (owner: 10David Caro) [09:49:05] !log hashar@deploy1002 Pruned MediaWiki: 1.39.0-wmf.23 (duration: 02m 20s) [09:49:40] (03CR) 10Aqu: "Hey Otto, I forget to `reply` sorry." [puppet] - 10https://gerrit.wikimedia.org/r/821695 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu) [09:49:42] (03CR) 10David Caro: "Blocked on I700dce85d20277bc8270e4e29ce276adaebedfa3" [alerts] - 10https://gerrit.wikimedia.org/r/813915 (owner: 10David Caro) [09:50:06] (03PS1) 10TrainBranchBot: group0 wikis to 1.39.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825721 (https://phabricator.wikimedia.org/T314187) [09:50:08] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.39.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825721 (https://phabricator.wikimedia.org/T314187) (owner: 10TrainBranchBot) [09:50:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T312972)', diff saved to https://phabricator.wikimedia.org/P32813 and previous config saved to /var/cache/conftool/dbconfig/20220823-095018-marostegui.json [09:50:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [09:50:23] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [09:50:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [09:50:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T312972)', diff saved to https://phabricator.wikimedia.org/P32814 and previous config saved to /var/cache/conftool/dbconfig/20220823-095039-marostegui.json [09:50:51] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/dse-k8s-ctrl on puppetmaster2001 is CRITICAL: File not found: /srv/config-master/pybal/eqiad/dse-k8s-ctrl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [09:51:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T312972)', diff saved to https://phabricator.wikimedia.org/P32815 and previous config saved to /var/cache/conftool/dbconfig/20220823-095146-marostegui.json [09:52:25] (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825721 (https://phabricator.wikimedia.org/T314187) (owner: 10TrainBranchBot) [09:52:32] (03Abandoned) 10Muehlenhoff: Extend custom raid fact to support Perc 750 [puppet] - 10https://gerrit.wikimedia.org/r/809913 (https://phabricator.wikimedia.org/T297913) (owner: 10Muehlenhoff) [09:53:20] (03CR) 10Klausman: [C: 03+1] Add a new VIP for dse-k8s-ctrl.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/825329 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [09:53:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:54:32] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/825720 (https://phabricator.wikimedia.org/T313312) (owner: 10Muehlenhoff) [09:54:47] (03CR) 10Muehlenhoff: [C: 03+2] Update PCI ID list for new raid_mgmt_tools fact [puppet] - 10https://gerrit.wikimedia.org/r/825720 (https://phabricator.wikimedia.org/T313312) (owner: 10Muehlenhoff) [09:55:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P32816 and previous config saved to /var/cache/conftool/dbconfig/20220823-095543-root.json [09:56:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 30%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32817 and previous config saved to /var/cache/conftool/dbconfig/20220823-095613-root.json [09:56:20] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.26 refs T314187 [09:56:23] T314187: 1.39.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T314187 [09:57:33] (03CR) 10Btullis: sre: port Zookeeper alerts (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/818402 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [09:58:00] (03CR) 10JMeybohm: [C: 03+1] Add a new VIP for dse-k8s-ctrl.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/825329 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [09:58:23] (03CR) 10Btullis: [C: 03+2] Add a new VIP for dse-k8s-ctrl.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/825329 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [09:58:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:58:32] 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 5 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10jcrespo) [10:00:01] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Add descriptions to BGP peers - https://phabricator.wikimedia.org/T313805 (10ayounsi) 05Openβ†’03Resolved a:03ayounsi Fixed everywhere I could find any. [10:00:59] (03PS1) 10FNegri: Add cloudcephosd1027 to the Ceph pool [puppet] - 10https://gerrit.wikimedia.org/r/825722 (https://phabricator.wikimedia.org/T314870) [10:01:43] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:02:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:02:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:06:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:06:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P32818 and previous config saved to /var/cache/conftool/dbconfig/20220823-100652-marostegui.json [10:10:25] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/dse-k8s-ctrl on puppetmaster1001 is CRITICAL: File not found: /srv/config-master/pybal/eqiad/dse-k8s-ctrl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:10:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P32819 and previous config saved to /var/cache/conftool/dbconfig/20220823-101048-root.json [10:11:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 40%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32820 and previous config saved to /var/cache/conftool/dbconfig/20220823-101117-root.json [10:17:29] (03PS1) 10Marostegui: site.pp: Remove db1189 from insetup [puppet] - 10https://gerrit.wikimedia.org/r/825723 (https://phabricator.wikimedia.org/T313569) [10:18:11] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove db1189 from insetup [puppet] - 10https://gerrit.wikimedia.org/r/825723 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [10:20:33] (03PS1) 10David Caro: openstack.wallaby: add missing domain_id parameter [puppet] - 10https://gerrit.wikimedia.org/r/825724 (https://phabricator.wikimedia.org/T315980) [10:21:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P32821 and previous config saved to /var/cache/conftool/dbconfig/20220823-102158-marostegui.json [10:22:08] 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Undocumented IP on WMCS network - https://phabricator.wikimedia.org/T315955 (10rook) Yeah, this is associated with the testing we're doing with magnum. It's part of 185.15.57.16/29 which was assigned to codfw1dev in T313977 How does one docum... [10:24:19] RECOVERY - SSH on ms-be1041.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:26:11] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:26:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 50%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32822 and previous config saved to /var/cache/conftool/dbconfig/20220823-102622-root.json [10:26:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:26:49] (03CR) 10FNegri: [C: 03+1] "LGTM. This was caused by https://gerrit.wikimedia.org/r/c/operations/puppet/+/825380 -- I wonder if it broke other things apart from this " [puppet] - 10https://gerrit.wikimedia.org/r/825724 (https://phabricator.wikimedia.org/T315980) (owner: 10David Caro) [10:26:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 5%: Repooling after cloning db1189', diff saved to https://phabricator.wikimedia.org/P32823 and previous config saved to /var/cache/conftool/dbconfig/20220823-102657-root.json [10:27:01] (03CR) 10Filippo Giunchedi: sre: port Zookeeper alerts (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/818402 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [10:28:16] (03PS1) 10David Caro: openstack.xena: add missing domain_id parameter [puppet] - 10https://gerrit.wikimedia.org/r/825725 (https://phabricator.wikimedia.org/T315980) [10:29:16] (03CR) 10David Caro: [C: 03+2] openstack.wallaby: add missing domain_id parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/825724 (https://phabricator.wikimedia.org/T315980) (owner: 10David Caro) [10:33:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:33:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:37:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T312972)', diff saved to https://phabricator.wikimedia.org/P32824 and previous config saved to /var/cache/conftool/dbconfig/20220823-103704-marostegui.json [10:37:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [10:37:09] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [10:37:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [10:37:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:37:23] 10ops-codfw, 10Discovery-Search: elastic2054 is down with memory error - https://phabricator.wikimedia.org/T315989 (10MoritzMuehlenhoff) [10:37:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:37:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T312972)', diff saved to https://phabricator.wikimedia.org/P32825 and previous config saved to /var/cache/conftool/dbconfig/20220823-103742-marostegui.json [10:38:24] (03CR) 10Btullis: sre: port Kafka alerts from Icinga (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/818108 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [10:38:35] (03PS4) 10Filippo Giunchedi: sre: port Zookeeper alerts [alerts] - 10https://gerrit.wikimedia.org/r/818402 (https://phabricator.wikimedia.org/T305847) [10:38:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T312972)', diff saved to https://phabricator.wikimedia.org/P32826 and previous config saved to /var/cache/conftool/dbconfig/20220823-103850-marostegui.json [10:39:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:41:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 60%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32827 and previous config saved to /var/cache/conftool/dbconfig/20220823-104126-root.json [10:42:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 10%: Repooling after cloning db1189', diff saved to https://phabricator.wikimedia.org/P32828 and previous config saved to /var/cache/conftool/dbconfig/20220823-104201-root.json [10:42:31] (03CR) 10Btullis: sre: port Zookeeper alerts (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/818402 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [10:45:31] (03PS3) 10Muehlenhoff: Initially adapt perccli to use the new raid_mgmt_tools fact [puppet] - 10https://gerrit.wikimedia.org/r/825369 [10:46:23] !log btullis@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-test-eqiad cluster: Roll restart of jvm daemons. [10:46:25] RECOVERY - Check systemd state on ms-be2039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:49:23] !log btullis@puppetmaster1001 conftool action : set/pooled=yes:weight=1; selector: cluster=dse-k8s,service=kubemaster [10:50:36] (03CR) 10Slyngshede: c:raid::perccli add PowerEdge RAID Controller monitoring to Icinga. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812250 (https://phabricator.wikimedia.org/T297913) (owner: 10Slyngshede) [10:51:52] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/825722 (https://phabricator.wikimedia.org/T314870) (owner: 10FNegri) [10:52:45] (03PS1) 10Btullis: Enable the LVS realserver profile for dse-k8s-ctrl [puppet] - 10https://gerrit.wikimedia.org/r/825726 (https://phabricator.wikimedia.org/T310172) [10:53:42] (03PS3) 10Vgutierrez: trafficserver: Fix open_write_fail_action values [puppet] - 10https://gerrit.wikimedia.org/r/825709 (https://phabricator.wikimedia.org/T315911) [10:53:44] (03PS1) 10Vgutierrez: trafficserver: Disable origin coalescing [puppet] - 10https://gerrit.wikimedia.org/r/825727 (https://phabricator.wikimedia.org/T315911) [10:53:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P32829 and previous config saved to /var/cache/conftool/dbconfig/20220823-105356-marostegui.json [10:54:13] (KubernetesRsyslogDown) firing: (2) rsyslog on dse-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:54:25] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:56:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 75%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32830 and previous config saved to /var/cache/conftool/dbconfig/20220823-105634-root.json [10:57:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: Repooling after cloning db1189', diff saved to https://phabricator.wikimedia.org/P32831 and previous config saved to /var/cache/conftool/dbconfig/20220823-105706-root.json [10:57:17] (03PS4) 10Muehlenhoff: Initially adapt perccli to use the new raid_mgmt_tools fact [puppet] - 10https://gerrit.wikimedia.org/r/825369 [10:57:31] (03CR) 10Muehlenhoff: Initially adapt perccli to use the new raid_mgmt_tools fact (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/825369 (owner: 10Muehlenhoff) [11:01:39] RECOVERY - Disk space on ms-be2039 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2039&var-datasource=codfw+prometheus/ops [11:03:04] (03Restored) 10Slyngshede: Extend custom raid fact to support Perc 750 [puppet] - 10https://gerrit.wikimedia.org/r/809913 (https://phabricator.wikimedia.org/T297913) (owner: 10Muehlenhoff) [11:04:02] (03PS4) 10Slyngshede: Extend custom raid fact to support Perc 750 [puppet] - 10https://gerrit.wikimedia.org/r/809913 (https://phabricator.wikimedia.org/T297913) (owner: 10Muehlenhoff) [11:04:04] (03PS2) 10Slyngshede: c:raid::perccli add PowerEdge RAID Controller monitoring to Icinga. [puppet] - 10https://gerrit.wikimedia.org/r/812250 (https://phabricator.wikimedia.org/T297913) [11:04:20] (03Abandoned) 10Slyngshede: Extend custom raid fact to support Perc 750 [puppet] - 10https://gerrit.wikimedia.org/r/809913 (https://phabricator.wikimedia.org/T297913) (owner: 10Muehlenhoff) [11:07:05] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36894/console" [puppet] - 10https://gerrit.wikimedia.org/r/812250 (https://phabricator.wikimedia.org/T297913) (owner: 10Slyngshede) [11:07:42] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] c:raid::perccli add PowerEdge RAID Controller monitoring to Icinga. [puppet] - 10https://gerrit.wikimedia.org/r/812250 (https://phabricator.wikimedia.org/T297913) (owner: 10Slyngshede) [11:09:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P32832 and previous config saved to /var/cache/conftool/dbconfig/20220823-110902-marostegui.json [11:09:51] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:10:57] o/ I have a patch in operations/puppet to schedule a weekly run of 3 scripts; who would be best to reach out to to get it merged? https://gerrit.wikimedia.org/r/c/operations/puppet/+/811312 [11:11:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 100%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P32833 and previous config saved to /var/cache/conftool/dbconfig/20220823-111139-root.json [11:12:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 50%: Repooling after cloning db1189', diff saved to https://phabricator.wikimedia.org/P32834 and previous config saved to /var/cache/conftool/dbconfig/20220823-111210-root.json [11:12:17] (03PS1) 10Slyngshede: c:raid::perccli add PowerEdge RAID Controller monitoring to Icinga. [puppet] - 10https://gerrit.wikimedia.org/r/825728 (https://phabricator.wikimedia.org/T297913) [11:13:16] (03Abandoned) 10Slyngshede: c:raid::perccli add PowerEdge RAID Controller monitoring to Icinga. [puppet] - 10https://gerrit.wikimedia.org/r/812250 (https://phabricator.wikimedia.org/T297913) (owner: 10Slyngshede) [11:13:23] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36896/console" [puppet] - 10https://gerrit.wikimedia.org/r/825728 (https://phabricator.wikimedia.org/T297913) (owner: 10Slyngshede) [11:13:27] (03PS2) 10Btullis: Enable the LVS realserver profile for dse-k8s-ctrl [puppet] - 10https://gerrit.wikimedia.org/r/825726 (https://phabricator.wikimedia.org/T310172) [11:14:35] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36897/console" [puppet] - 10https://gerrit.wikimedia.org/r/825726 (https://phabricator.wikimedia.org/T310172) (owner: 10Btullis) [11:16:49] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:22:19] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [11:22:56] 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 5 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10jcrespo) [11:24:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T312972)', diff saved to https://phabricator.wikimedia.org/P32835 and previous config saved to /var/cache/conftool/dbconfig/20220823-112408-marostegui.json [11:24:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [11:24:14] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [11:24:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [11:24:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T312972)', diff saved to https://phabricator.wikimedia.org/P32836 and previous config saved to /var/cache/conftool/dbconfig/20220823-112430-marostegui.json [11:24:39] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 5 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [11:25:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T312972)', diff saved to https://phabricator.wikimedia.org/P32837 and previous config saved to /var/cache/conftool/dbconfig/20220823-112537-marostegui.json [11:26:13] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:27:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 75%: Repooling after cloning db1189', diff saved to https://phabricator.wikimedia.org/P32838 and previous config saved to /var/cache/conftool/dbconfig/20220823-112715-root.json [11:30:53] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:36:07] (03CR) 10Majavah: scap: introduce bootstrapping mechanism specific to deployment hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/820749 (owner: 10Jaime Nuche) [11:38:43] (03PS1) 10Hnowlan: api-gateway: custom host overrides in discovery services. [deployment-charts] - 10https://gerrit.wikimedia.org/r/825729 [11:40:29] (03CR) 10Filippo Giunchedi: sre: port Kafka alerts from Icinga (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/818108 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [11:40:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P32839 and previous config saved to /var/cache/conftool/dbconfig/20220823-114043-marostegui.json [11:42:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 100%: Repooling after cloning db1189', diff saved to https://phabricator.wikimedia.org/P32840 and previous config saved to /var/cache/conftool/dbconfig/20220823-114220-root.json [11:43:32] (03CR) 10Jaime Nuche: scap: introduce bootstrapping mechanism specific to deployment hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/820749 (owner: 10Jaime Nuche) [11:45:25] (03PS4) 10Filippo Giunchedi: sre: port Kafka alerts from Icinga [alerts] - 10https://gerrit.wikimedia.org/r/818108 (https://phabricator.wikimedia.org/T305847) [11:45:27] (03PS5) 10Filippo Giunchedi: sre: port Zookeeper alerts [alerts] - 10https://gerrit.wikimedia.org/r/818402 (https://phabricator.wikimedia.org/T305847) [11:48:11] (03PS1) 10Ori: Incremental roll-out of query-sorting (15%) [puppet] - 10https://gerrit.wikimedia.org/r/825730 (https://phabricator.wikimedia.org/T314868) [11:55:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P32841 and previous config saved to /var/cache/conftool/dbconfig/20220823-115549-marostegui.json [12:00:11] 10SRE, 10Infrastructure-Foundations, 10netops: Management routers: use BGP instead of OSPF - https://phabricator.wikimedia.org/T294845 (10ayounsi) a:05ayounsiβ†’03None [12:08:05] (03PS3) 10Majavah: P:dumps: remove ipv4/ipv6 separation from internal_rsync_clients [puppet] - 10https://gerrit.wikimedia.org/r/825685 [12:09:18] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36898/console" [puppet] - 10https://gerrit.wikimedia.org/r/825685 (owner: 10Majavah) [12:09:33] (03CR) 10Btullis: [C: 03+1] "LGTM, thanks." [alerts] - 10https://gerrit.wikimedia.org/r/818108 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [12:10:52] (03CR) 10Btullis: [C: 03+1] sre: port Kafka alerts from Icinga (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/818108 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [12:10:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T312972)', diff saved to https://phabricator.wikimedia.org/P32842 and previous config saved to /var/cache/conftool/dbconfig/20220823-121055-marostegui.json [12:10:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [12:11:01] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [12:11:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [12:11:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [12:11:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [12:11:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [12:11:45] (03CR) 10Btullis: [C: 03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/818863 (https://phabricator.wikimedia.org/T312539) (owner: 10Muehlenhoff) [12:11:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [12:11:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T312972)', diff saved to https://phabricator.wikimedia.org/P32843 and previous config saved to /var/cache/conftool/dbconfig/20220823-121159-marostegui.json [12:13:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T312972)', diff saved to https://phabricator.wikimedia.org/P32844 and previous config saved to /var/cache/conftool/dbconfig/20220823-121305-marostegui.json [12:13:25] (03CR) 10Btullis: [C: 03+1] "Nice." [puppet] - 10https://gerrit.wikimedia.org/r/811986 (https://phabricator.wikimedia.org/T309622) (owner: 10Ottomata) [12:18:08] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: port Kafka alerts from Icinga [alerts] - 10https://gerrit.wikimedia.org/r/818108 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [12:18:55] (03CR) 10Filippo Giunchedi: [C: 03+2] "Thank you all for the reviews! I'm going ahead with this for now and we can revisit in a followup review too" [alerts] - 10https://gerrit.wikimedia.org/r/818108 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [12:18:58] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] sre: port Kafka alerts from Icinga [alerts] - 10https://gerrit.wikimedia.org/r/818108 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [12:28:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P32845 and previous config saved to /var/cache/conftool/dbconfig/20220823-122811-marostegui.json [12:28:29] (03PS1) 10David Caro: wmcs.openstack.quota_increase: allow all known quota types [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/825736 (https://phabricator.wikimedia.org/T315961) [12:29:54] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/825728 (https://phabricator.wikimedia.org/T297913) (owner: 10Slyngshede) [12:31:02] (03CR) 10RhinosF1: wmcs.openstack.quota_increase: allow all known quota types (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/825736 (https://phabricator.wikimedia.org/T315961) (owner: 10David Caro) [12:32:51] (03CR) 10Vgutierrez: [C: 03+2] Incremental roll-out of query-sorting (15%) [puppet] - 10https://gerrit.wikimedia.org/r/825730 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori) [12:33:26] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] c:raid::perccli add PowerEdge RAID Controller monitoring to Icinga. [puppet] - 10https://gerrit.wikimedia.org/r/825728 (https://phabricator.wikimedia.org/T297913) (owner: 10Slyngshede) [12:33:36] !log Incremental roll-out of query-sorting (15%) - T314868 [12:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:40] T314868: Roll out query parameter normalization - https://phabricator.wikimedia.org/T314868 [12:35:05] (03CR) 10CI reject: [V: 04-1] wmcs.openstack.quota_increase: allow all known quota types [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/825736 (https://phabricator.wikimedia.org/T315961) (owner: 10David Caro) [12:39:10] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2019.codfw.wmnet to cluster codfw and group B [12:40:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2019.codfw.wmnet to cluster codfw and group B [12:41:49] (03CR) 10David Caro: wmcs.openstack.quota_increase: allow all known quota types (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/825736 (https://phabricator.wikimedia.org/T315961) (owner: 10David Caro) [12:42:49] (03PS2) 10David Caro: wmcs.openstack.quota_increase: allow all known quota types [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/825736 (https://phabricator.wikimedia.org/T315961) [12:43:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P32846 and previous config saved to /var/cache/conftool/dbconfig/20220823-124317-marostegui.json [12:45:04] (03PS1) 10Majavah: raid: use modern nrpe defines [puppet] - 10https://gerrit.wikimedia.org/r/825740 [12:45:43] (03PS1) 10Filippo Giunchedi: sre: alert on appserver unavailability [alerts] - 10https://gerrit.wikimedia.org/r/825741 (https://phabricator.wikimedia.org/T305847) [12:46:44] (03PS1) 10Filippo Giunchedi: mediawiki: stop checking per-appserver availability [puppet] - 10https://gerrit.wikimedia.org/r/825742 (https://phabricator.wikimedia.org/T314118) [12:49:39] (03CR) 10CI reject: [V: 04-1] wmcs.openstack.quota_increase: allow all known quota types [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/825736 (https://phabricator.wikimedia.org/T315961) (owner: 10David Caro) [12:54:12] (03PS2) 10Filippo Giunchedi: mediawiki: stop checking per-appserver availability [puppet] - 10https://gerrit.wikimedia.org/r/825742 (https://phabricator.wikimedia.org/T314118) [12:58:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T312972)', diff saved to https://phabricator.wikimedia.org/P32847 and previous config saved to /var/cache/conftool/dbconfig/20220823-125824-marostegui.json [12:58:29] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [12:58:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [12:58:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [12:58:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 15 hosts with reason: Maintenance [12:59:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 15 hosts with reason: Maintenance [12:59:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance [12:59:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220823T1300). nyaa~ [13:00:04] No Gerrit patches in the queue for this window AFAICS. [13:00:04] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220823T1300) [13:00:23] * urbanecm waves [13:01:02] (03PS3) 10David Caro: wmcs.openstack.quota_increase: allow all known quota types [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/825736 (https://phabricator.wikimedia.org/T315961) [13:04:06] 10SRE, 10Analytics-Radar, 10Machine-Learning-Team: Running docker containers in a non-production environment - https://phabricator.wikimedia.org/T275551 (10Ottomata) > will it be possible to consume e.g. events from kafka infra, or read/write to swift? Nopers :/ > Is this the recommended way for running co... [13:05:51] (03PS3) 10FNegri: ceph.bootstrapp_and_add: don't rely on sda/sdb being the os disks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824457 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [13:06:57] (03PS2) 10FNegri: Add cloudcephosd1027 to the Ceph pool [puppet] - 10https://gerrit.wikimedia.org/r/825722 (https://phabricator.wikimedia.org/T314870) [13:07:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install wdqs101[4,5,6] - https://phabricator.wikimedia.org/T307138 (10bking) [13:09:06] (03CR) 10FNegri: [C: 03+2] Add cloudcephosd1027 to the Ceph pool [puppet] - 10https://gerrit.wikimedia.org/r/825722 (https://phabricator.wikimedia.org/T314870) (owner: 10FNegri) [13:12:30] (03CR) 10FNegri: [C: 03+2] "LGTM" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824457 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [13:12:39] urbanecm, would you have time to deploy a config patch? [13:12:47] sure thing zabe [13:12:56] can you add it to the calendar? [13:13:14] yes [13:13:59] (03PS3) 10Zabe: Start writing to cuc_actor on s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824152 (https://phabricator.wikimedia.org/T233004) [13:14:02] urbanecm, done [13:14:37] (03CR) 10Urbanecm: [C: 03+2] Start writing to cuc_actor on s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824152 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [13:14:43] let's do it :) [13:15:23] (03Merged) 10jenkins-bot: Start writing to cuc_actor on s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824152 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [13:15:45] zabe: pulled to mwdebug1001 [13:16:48] urbanecm, did https://www.wikidata.org/w/index.php?title=User:Zabe/Test&oldid=1711236363 could you check cu_changes? [13:16:52] ah wait [13:16:52] sure [13:16:59] forgot to enable the extension [13:17:04] okay, waiting [13:17:57] okay, https://www.wikidata.org/w/index.php?title=User%3AZabe%2FTest&type=revision&diff=1711237252&oldid=1711236960 should be it [13:18:14] urbanecm, ^ [13:18:52] `cuc_actor: 3172610`, `select * from actor where actor_id=3172610;` says Zabe, so, let's go for it? [13:19:10] logstash's also empty [13:19:31] urbanecm, should be good then [13:19:33] syncing! [13:19:41] (03Merged) 10jenkins-bot: ceph.bootstrapp_and_add: don't rely on sda/sdb being the os disks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824457 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [13:20:10] zabe: just curious, what blocks the switch at s4? [13:21:06] cuc_actor does not yet exist on db1160 (s4 master), see T303603 [13:21:07] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [13:21:13] ah [13:21:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:22:25] (03PS3) 10Hnowlan: jobqueue: increase num_workers to 4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/820117 (https://phabricator.wikimedia.org/T300914) [13:22:39] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36899/console" [puppet] - 10https://gerrit.wikimedia.org/r/825727 (https://phabricator.wikimedia.org/T315911) (owner: 10Vgutierrez) [13:23:07] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: b3b9e0a976506eb96252d2180e03a055bd6cc68a: Start writing to cuc_actor on s8 (T233004) (duration: 03m 31s) [13:23:13] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [13:23:15] zabe: should be live. anything else? [13:23:23] no thanks [13:23:34] but actually there was T311611 [13:23:35] T311611: Switchover s4 master - https://phabricator.wikimedia.org/T311611 [13:24:00] so db1160 is no longer the master, maybe I can poke Amir to finish the schema change [13:24:02] i do see the column at s4 eqiad master (db1138) [13:24:08] db1160 misses it though [13:24:23] I am on an interview, but do I need to depool db1160? [13:24:45] marostegui: that column's not used now [13:24:49] no, there is nothing urgent [13:24:51] urbanecm: ok [13:24:52] thanks :* [13:24:55] Amir1: ^ [13:24:58] (he's out today) [13:26:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:26:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:27:53] I'm out sick today but if it's urgent. I can do something [13:29:13] We have s4 switchover soon so it might actually make things bad. I will double check. Please ping me tomorrow zabe [13:29:31] (03CR) 10JMeybohm: [C: 03+1] Basic blubber file for thumbor (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/813613 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan) [13:29:38] 10SRE, 10serviceops, 10Sustainability (Incident Followup): Set API server weights - https://phabricator.wikimedia.org/T304800 (10Dzahn) @RLazarus @joe Just saw this again in the history after a while. re: https://config-master.wikimedia.org/pybal/eqiad/api-https My suggestion was to set **mw1307 through m... [13:29:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:30:10] it's not urgent, but I would be happy if the schema change gets done before this switchover happens [13:30:12] will do [13:30:30] 10SRE, 10Analytics-Radar, 10Machine-Learning-Team: Using docker in WMF production network outside of kubernetes - https://phabricator.wikimedia.org/T275551 (10fkaelin) [13:31:19] 10SRE, 10Infrastructure-Foundations, 10netops: Occasional high ICMP probe response from codfw to cr2-drmrs - https://phabricator.wikimedia.org/T315645 (10cmooney) We had a brief discussion about this within Infra Foundations and the consensus is roughly the same, i.e. it doesn't appear the root cause of thes... [13:31:53] 10SRE, 10Analytics-Radar, 10Machine-Learning-Team: Using docker in WMF production network outside of kubernetes - https://phabricator.wikimedia.org/T275551 (10fkaelin) > I wonder an even more useful title would be "Using docker in WMF production network outside of kubernetes", as this is the real issue. Goo... [13:34:30] 10SRE, 10Infrastructure-Foundations, 10netops: Occasional high ICMP probe response from codfw to cr2-drmrs - https://phabricator.wikimedia.org/T315645 (10Vgutierrez) ack, thanks for checking the issue guys :) [13:35:07] 10SRE, 10Infrastructure-Foundations, 10netops: Occasional high ICMP probe response from codfw to cr1-drmrs - https://phabricator.wikimedia.org/T315645 (10cmooney) [13:39:02] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Papaul) @Marostegui can you access those servers using serial console and provide me with the IP addresses. [13:44:39] (03Abandoned) 10Hnowlan: WIP: build docker images using blubber and pip dependencies [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/771416 (https://phabricator.wikimedia.org/T267327) (owner: 10Hnowlan) [13:45:24] (03PS1) 10Jgiannelos: mobileapps: Bump to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/825750 [13:46:38] (03PS5) 10Hnowlan: Basic blubber file for thumbor [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/813613 (https://phabricator.wikimedia.org/T312104) [13:46:51] (03CR) 10Hnowlan: Basic blubber file for thumbor (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/813613 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan) [13:49:16] (03PS1) 10Vgutierrez: Revert "ATS: force cache revalidation for 7 wikis" [puppet] - 10https://gerrit.wikimedia.org/r/825692 [13:51:05] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/825369 (owner: 10Muehlenhoff) [13:51:08] (03PS2) 10Vgutierrez: Revert "ATS: force cache revalidation for 7 wikis" [puppet] - 10https://gerrit.wikimedia.org/r/825692 (https://phabricator.wikimedia.org/T274784) [13:52:18] (03CR) 10FNegri: [C: 03+1] openstack.xena: add missing domain_id parameter [puppet] - 10https://gerrit.wikimedia.org/r/825725 (https://phabricator.wikimedia.org/T315980) (owner: 10David Caro) [13:53:06] (03CR) 10Jgiannelos: [C: 03+2] mobileapps: Bump to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/825750 (owner: 10Jgiannelos) [13:54:20] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10Krinkle) [13:56:09] 10SRE, 10Infrastructure-Foundations, 10netops: Occasional high ICMP probe response from codfw to cr1-drmrs - https://phabricator.wikimedia.org/T315645 (10ayounsi) 05Openβ†’03Stalled [13:56:17] (03CR) 10Vgutierrez: [C: 03+2] Revert "ATS: force cache revalidation for 7 wikis" [puppet] - 10https://gerrit.wikimedia.org/r/825692 (https://phabricator.wikimedia.org/T274784) (owner: 10Vgutierrez) [13:56:40] (03Merged) 10jenkins-bot: mobileapps: Bump to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/825750 (owner: 10Jgiannelos) [13:57:16] (03PS1) 10Jdlrobson: Remove grid row gap in favor of margins [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825752 (https://phabricator.wikimedia.org/T315595) [13:58:55] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [13:59:01] (03PS1) 10Dzahn: scap/dsh: remove parsoid service, replaced by parsoid-php [puppet] - 10https://gerrit.wikimedia.org/r/825753 (https://phabricator.wikimedia.org/T241207) [13:59:28] (03PS2) 10Ayounsi: junos_set_interface_config: fix logic error [cookbooks] - 10https://gerrit.wikimedia.org/r/821688 [13:59:30] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [14:00:58] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [14:01:53] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [14:02:32] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [14:03:24] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [14:04:32] (03CR) 10Hnowlan: [C: 03+2] Basic blubber file for thumbor [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/813613 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan) [14:06:22] (03Merged) 10jenkins-bot: Basic blubber file for thumbor [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/813613 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan) [14:06:50] (03CR) 10Ottomata: [C: 03+1] Puppetize spark3 installation and configs using conda-analytics env V2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/821695 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu) [14:06:52] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:07:23] (03PS1) 10ClΓ©ment Goubert: admin: set shell to undef if user is removed [puppet] - 10https://gerrit.wikimedia.org/r/825755 [14:08:35] (03CR) 10Ottomata: [C: 03+1] "We can eventually install on the whole test cluster, I just want to do it in more than one patch. We do a hacky install on one node first" [puppet] - 10https://gerrit.wikimedia.org/r/821695 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu) [14:08:39] (03CR) 10ClΓ©ment Goubert: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36900/console" [puppet] - 10https://gerrit.wikimedia.org/r/825755 (owner: 10ClΓ©ment Goubert) [14:11:17] (03CR) 10ClΓ©ment Goubert: [V: 03+1] "Unbreaking the required shell install created a dep loop, can you please check this out since it's related to I73e749f6390 ?" [puppet] - 10https://gerrit.wikimedia.org/r/825755 (owner: 10ClΓ©ment Goubert) [14:11:28] (03CR) 10Vgutierrez: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/824793 (https://phabricator.wikimedia.org/T260943) (owner: 10BCornwall) [14:11:54] (03CR) 10BCornwall: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/824793 (https://phabricator.wikimedia.org/T260943) (owner: 10BCornwall) [14:12:02] (03CR) 10CI reject: [V: 04-1] Remove grid row gap in favor of margins [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825752 (https://phabricator.wikimedia.org/T315595) (owner: 10Jdlrobson) [14:13:35] (03CR) 10Herron: [C: 03+1] netmon: Configure Logrotate for LibreNMS logs [puppet] - 10https://gerrit.wikimedia.org/r/823764 (https://phabricator.wikimedia.org/T315393) (owner: 10Andrea Denisse) [14:13:37] (03CR) 10Ottomata: "OH i had a comment but didn't hit reply! Sorry!" [alerts] - 10https://gerrit.wikimedia.org/r/818108 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [14:14:12] (03CR) 10ClΓ©ment Goubert: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36901/console" [puppet] - 10https://gerrit.wikimedia.org/r/825755 (owner: 10ClΓ©ment Goubert) [14:14:14] 10SRE, 10Performance-Team, 10Platform Engineering, 10Goal: Decommission the "session redis" cluster - https://phabricator.wikimedia.org/T243520 (10Krinkle) [14:14:26] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:14:27] 10SRE, 10Performance-Team, 10Platform Engineering, 10Goal: Decommission the "session redis" cluster - https://phabricator.wikimedia.org/T243520 (10Krinkle) [14:14:42] 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 5 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Krinkle) [14:15:30] 10SRE, 10observability, 10Sustainability (Incident Followup), 10User-Joe, 10User-jijiki: Monitor redis memory/disk usage - https://phabricator.wikimedia.org/T110169 (10Krinkle) To clarify, this task and the linked incident, are about the `rdb*` hosts. These are known to MW as `redis_lock` and in monitori... [14:15:33] 10SRE, 10Citoid, 10serviceops, 10Patch-For-Review: Create a readiness probe for zotero - https://phabricator.wikimedia.org/T213689 (10akosiaris) 05Openβ†’03Resolved a:03akosiaris This has been done in https://gerrit.wikimedia.org/r/c/mediawiki/services/zotero/+/774848 and overall seems to work fine (as... [14:15:55] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Marostegui) db1185: 10.64.0.108 This one seems to have link ok so maybe it is not the cable?: ` root@db1185:~# mii-tool eno1 eno1: negotiated 1000... [14:16:31] 10SRE, 10observability, 10Sustainability (Incident Followup), 10User-Joe, 10User-jijiki: Monitor rdb hosts for memory/disk usage (redis_lock, aka redis_misc) - https://phabricator.wikimedia.org/T110169 (10Krinkle) [14:16:42] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:16:53] 10SRE, 10Data-Engineering, 10Event-Platform Value Stream, 10serviceops: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10akosiaris) Hi @Ottomata, @JArguello-WMF /me is back. Any updates on this one (even if just a rough timeline) ? Anything we can help... [14:17:31] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Convert helm releases to the new release naming schema - https://phabricator.wikimedia.org/T277849 (10akosiaris) @JMeybohm anything left to do here? [14:20:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1160', diff saved to https://phabricator.wikimedia.org/P32848 and previous config saved to /var/cache/conftool/dbconfig/20220823-142011-root.json [14:21:08] !log Run schema change on db1160 T303603 [14:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:12] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [14:21:35] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Papaul) in Netbox we have db1186 using 10.64.0.108 db1188 using 10.64.16.238 10.64.0.142 in Netbox is not assign to any hosts so there is IP ad... [14:22:14] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Convert helm releases to the new release naming schema - https://phabricator.wikimedia.org/T277849 (10JMeybohm) >>! In T277849#8178377, @akosiaris wrote: > @JMeybohm anything left to do here? Yeah. We did not do that during the helm3 migration. Maybe i... [14:24:06] (03CR) 10Andrew Bogott: "ty!" [puppet] - 10https://gerrit.wikimedia.org/r/825724 (https://phabricator.wikimedia.org/T315980) (owner: 10David Caro) [14:24:39] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Papaul) db1186 is using in Netbox the IP assigned to db1185 `` 10.64.0.108/22 Global Active β€” β€” db1186.eqiad.wmnet [14:26:51] (03CR) 10David Caro: [C: 03+2] openstack.xena: add missing domain_id parameter [puppet] - 10https://gerrit.wikimedia.org/r/825725 (https://phabricator.wikimedia.org/T315980) (owner: 10David Caro) [14:26:58] jouncebot now [14:26:58] No deployments scheduled for the next 1 hour(s) and 33 minute(s) [14:27:13] (03PS3) 10Krinkle: redis: Remove references to nutcracker and redis_sessions cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824734 (https://phabricator.wikimedia.org/T267581) [14:27:20] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:27:32] (03CR) 10Krinkle: [C: 03+2] redis: Remove references to nutcracker and redis_sessions cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824734 (https://phabricator.wikimedia.org/T267581) (owner: 10Krinkle) [14:28:17] (03Merged) 10jenkins-bot: redis: Remove references to nutcracker and redis_sessions cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824734 (https://phabricator.wikimedia.org/T267581) (owner: 10Krinkle) [14:28:22] zabe urbanecm Amir1 I am deploying that change on db1160, will close the ticket when done (shouldn't take long) [14:28:35] thanks marostegui! [14:29:15] nice, we can start writing on s4 then. Thanks! [14:30:31] (03PS1) 10Zabe: Start writing to cuc_actor everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825760 (https://phabricator.wikimedia.org/T233004) [14:30:42] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:31:10] zabe: if you have some time after the schema change is deployed, happy to deploy the config patch too. [14:31:49] ok :) [14:31:58] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:00] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:33:48] zabe urbanecm Amir1 done [14:33:55] that was quick, thanks [14:34:28] thanks [14:34:38] urbanecm, wanna do the deploy? [14:34:43] (03CR) 10Urbanecm: [C: 03+2] Start writing to cuc_actor everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825760 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [14:34:48] yep yep [14:34:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 5%: Repooling after schema change', diff saved to https://phabricator.wikimedia.org/P32849 and previous config saved to /var/cache/conftool/dbconfig/20220823-143450-root.json [14:35:32] (03Merged) 10jenkins-bot: Start writing to cuc_actor everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825760 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [14:35:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:35:48] zabe: it's at mwdebug1001 now [14:37:55] urbanecm, created https://commons.wikimedia.org/wiki/User:Zabe/Test without problems, nothing in logstash. Could you check cuc_actor as usual? [14:38:00] sure, doing [14:38:48] zabe: cuc_actor LGTM [14:38:52] so, let's sync? [14:39:09] yep [14:39:17] !log krinkle@deploy1002 Synchronized wmf-config/redis.php: Ib9947993bd5710a4 (duration: 03m 47s) [14:39:26] doing [14:40:31] * Krinkle revokes deploy handle [14:40:49] Krinkle: sorry, i didn't see your +2 in config [14:41:33] ive got more but it's not urgent, go ahead [14:42:43] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: bcef1d56ef7c43e82121c7e70fae58784203e33d: Start writing to cuc_actor everywhere (T233004) (duration: 03m 18s) [14:42:48] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [14:42:55] zabe: and we're writing everywhere [14:43:04] nice! [14:43:09] thanks for your help :) [14:43:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:43:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:43:15] Krinkle: over to you :). [14:43:32] thx, I'll give it a few minutes while I prep the next [14:43:38] (03PS2) 10Krinkle: Remove references to now-empty redis.php file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824736 (https://phabricator.wikimedia.org/T267581) [14:43:42] (03PS2) 10Krinkle: redis: Remove now-empty and unreferenced redis.php file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824737 (https://phabricator.wikimedia.org/T267581) [14:43:53] * Krinkle takes global lock [14:46:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:47:30] PROBLEM - Check systemd state on mw2387 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:47:42] (03CR) 10Vgutierrez: "Please add a test for this on modules/varnish/files/tests/text/02-frontend-headers.vtc" [puppet] - 10https://gerrit.wikimedia.org/r/824793 (https://phabricator.wikimedia.org/T260943) (owner: 10BCornwall) [14:48:20] !log installing libtirpc security updates [14:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:34] (03CR) 10Cwhite: sre: port Kafka alerts from Icinga (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/818108 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [14:49:49] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [14:49:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 10%: Repooling after schema change', diff saved to https://phabricator.wikimedia.org/P32850 and previous config saved to /var/cache/conftool/dbconfig/20220823-144954-root.json [14:51:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:53:48] * Krinkle unlocks global lock [14:54:13] (KubernetesRsyslogDown) firing: (2) rsyslog on dse-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:54:41] (03PS1) 10Muehlenhoff: Remove access for eyener [puppet] - 10https://gerrit.wikimedia.org/r/825763 [14:54:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10User-dcaro, 10cloud-services-team (Hardware): cloudcephosd10[25-34] Missing/unplugged hard drives - https://phabricator.wikimedia.org/T315221 (10dcaro) [14:55:19] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:55:32] Hey all - going to try a quick sec patch deploy for T307278. [14:56:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:56:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:57:06] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for eyener [puppet] - 10https://gerrit.wikimedia.org/r/825763 (owner: 10Muehlenhoff) [14:57:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:58:00] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [14:58:21] 10SRE, 10Cloud Services Proposals, 10Infrastructure-Foundations, 10netops: Separate WMCS control and management plane traffic - https://phabricator.wikimedia.org/T314847 (10nskaggs) See also https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/notes/Service_predictions_for_cross_realm_situation, {T27... [15:00:24] jouncebot: now [15:00:25] No deployments scheduled for the next 0 hour(s) and 59 minute(s) [15:01:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:02:29] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Effeietsanders out of all services on: 774 hosts [15:04:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Effeietsanders out of all services on: 774 hosts [15:04:23] 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Krinkle) a:05Krinkleβ†’03None [15:04:24] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Effeietsanders out of all services on: 1238 hosts [15:04:45] mutante: sec deploy in progress [15:04:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Effeietsanders out of all services on: 1238 hosts [15:04:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 25%: Repooling after schema change', diff saved to https://phabricator.wikimedia.org/P32851 and previous config saved to /var/cache/conftool/dbconfig/20220823-150459-root.json [15:05:13] Krinkle: ok, I am waiting. What I want is to restart gerrit service [15:05:28] so a couple seconds of no merges [15:05:34] that's probably fine, these are not going through gerrit [15:05:48] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Erin Yener out of all services on: 1238 hosts [15:05:49] ok, thanks. then I'll just do it [15:06:16] !log gerrit - service restart - T315942 - added sshd.enableDeprecatedKexAlgorithms = true [15:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:20] T315942: Diffusion mirrors of Gerrit repos not showing commits made since August 17 - https://phabricator.wikimedia.org/T315942 [15:06:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Erin Yener out of all services on: 1238 hosts [15:06:33] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Erin Yener out of all services on: 774 hosts [15:07:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Erin Yener out of all services on: 774 hosts [15:08:19] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Dave Pifke out of all services on: 1238 hosts [15:08:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Dave Pifke out of all services on: 1238 hosts [15:09:07] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Dave Pifke out of all services on: 774 hosts [15:09:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Dave Pifke out of all services on: 774 hosts [15:14:36] (03CR) 10Jdlrobson: [C: 04-1] Remove grid row gap in favor of margins [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825752 (https://phabricator.wikimedia.org/T315595) (owner: 10Jdlrobson) [15:18:02] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Papaul) @Marostegui db1186 and db1188 are now back online db1185 is still showing that the ling is down so i will have @Jclark-ctr check the cable... [15:20:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 50%: Repooling after schema change', diff saved to https://phabricator.wikimedia.org/P32852 and previous config saved to /var/cache/conftool/dbconfig/20220823-152003-root.json [15:21:03] (03PS1) 10Vlad.shapik: Revert setting expiry headers on thumbnails [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/825788 (https://phabricator.wikimedia.org/T252719) [15:25:37] !log Deployed security patch for T307278 to wmf.26 [15:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:17] (03PS1) 10ClΓ©ment Goubert: rsync: unbreak header conf with no fragment in rsync::server [puppet] - 10https://gerrit.wikimedia.org/r/825793 [15:30:17] (03CR) 10Vlad.shapik: Set expiry headers on thumbnails (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/489022 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles) [15:30:31] (03PS2) 10Filippo Giunchedi: sre: alert on appserver unavailability [alerts] - 10https://gerrit.wikimedia.org/r/825741 (https://phabricator.wikimedia.org/T305847) [15:30:40] !log Deployed security patch for T307278 to wmf.25 [15:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:42] * Krinkle takes scap lock again [15:34:53] (03CR) 10Krinkle: [C: 03+2] Remove references to now-empty redis.php file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824736 (https://phabricator.wikimedia.org/T267581) (owner: 10Krinkle) [15:35:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 75%: Repooling after schema change', diff saved to https://phabricator.wikimedia.org/P32853 and previous config saved to /var/cache/conftool/dbconfig/20220823-153508-root.json [15:35:38] (03Merged) 10jenkins-bot: Remove references to now-empty redis.php file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824736 (https://phabricator.wikimedia.org/T267581) (owner: 10Krinkle) [15:36:53] !log gerrit2002 - service restart [15:36:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:05] (03PS1) 10Krinkle: noc: Remove left-over redis.php.txt link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825795 [15:37:12] (03CR) 10Krinkle: [C: 03+2] noc: Remove left-over redis.php.txt link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825795 (owner: 10Krinkle) [15:38:01] (03Merged) 10jenkins-bot: noc: Remove left-over redis.php.txt link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825795 (owner: 10Krinkle) [15:40:46] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:40:56] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:41:25] !log gerrit - service restart [15:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:26] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:42:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:43:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:43:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:44:14] (03CR) 10Hnowlan: [C: 03+1] Revert setting expiry headers on thumbnails [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/825788 (https://phabricator.wikimedia.org/T252719) (owner: 10Vlad.shapik) [15:44:26] !log krinkle@deploy1002 Synchronized wmf-config/: I1c5b0597817eb02 (duration: 03m 25s) [15:44:36] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 23 Oct 2022 06:50:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:44:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:45:16] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48535 bytes in 0.111 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:45:26] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.299 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:47:15] (03CR) 10Krinkle: [C: 03+2] redis: Remove now-empty and unreferenced redis.php file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824737 (https://phabricator.wikimedia.org/T267581) (owner: 10Krinkle) [15:48:16] (03Merged) 10jenkins-bot: redis: Remove now-empty and unreferenced redis.php file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824737 (https://phabricator.wikimedia.org/T267581) (owner: 10Krinkle) [15:49:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:50:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 100%: Repooling after schema change', diff saved to https://phabricator.wikimedia.org/P32854 and previous config saved to /var/cache/conftool/dbconfig/20220823-155013-root.json [15:50:17] (03Abandoned) 10Jdlrobson: Remove grid row gap in favor of margins [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825752 (https://phabricator.wikimedia.org/T315595) (owner: 10Jdlrobson) [15:52:44] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:54:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:55:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:00:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:01:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:01:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:01:20] (03PS1) 10Ladsgroup: SpecialRecentChangesLinked: Pass query builder instead of SQL [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825846 [16:01:35] (03PS1) 10Ladsgroup: SpecialRecentChangesLinked: Pass query builder instead of SQL [core] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/825847 [16:02:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:03:54] !log krinkle@deploy1002 Synchronized wmf-config/: Ifd90aedd7c517481f9 (duration: 03m 18s) [16:07:16] PROBLEM - Check systemd state on ms-be2039 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:21:33] jouncebot: nowandnext [16:21:33] For the next 0 hour(s) and 38 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220823T1600) [16:21:33] In 1 hour(s) and 38 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220823T1800) [16:21:41] cool [16:21:50] (03CR) 10Ladsgroup: [C: 03+2] SpecialRecentChangesLinked: Pass query builder instead of SQL [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825846 (owner: 10Ladsgroup) [16:21:53] (03CR) 10Ladsgroup: [C: 03+2] SpecialRecentChangesLinked: Pass query builder instead of SQL [core] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/825847 (owner: 10Ladsgroup) [16:22:04] (03PS1) 10Krinkle: [WIP] Remove test-wikipedia-icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825829 [16:27:54] (03PS1) 10Ladsgroup: rdbms: Switch to getConnectionInternal() in getPrimaryPos() [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825848 [16:27:58] (03CR) 10Ladsgroup: [C: 03+2] rdbms: Switch to getConnectionInternal() in getPrimaryPos() [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825848 (owner: 10Ladsgroup) [16:29:30] 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 4 others: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835 (10dcausse) The 3 tasks above should be the followups of this incident. The root cause of the incident is I think a mix of the poor `swift` client... [16:29:54] 10SRE, 10Performance-Team, 10serviceops: Clean up testwiki experiments (Aug 2022) - https://phabricator.wikimedia.org/T314750 (10Krinkle) [16:30:54] (03CR) 10Matthias Mullie: "Any other concerns? Can we move this forward & get this merged please?" [puppet] - 10https://gerrit.wikimedia.org/r/811312 (https://phabricator.wikimedia.org/T300024) (owner: 10Matthias Mullie) [16:31:01] (03PS1) 10Krinkle: Undeploy ShortUrl extension from test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825833 (https://phabricator.wikimedia.org/T314750) [16:31:14] 10SRE, 10serviceops, 10Sustainability (Incident Followup): Set API server weights - https://phabricator.wikimedia.org/T304800 (10RLazarus) That sounds right to me; it would give us the same distribution as codfw, which is probably as much work as we need to do on this. I don't think it's worth investing time... [16:33:41] (03PS1) 10Krinkle: Disable wgCiteResponsiveReferences on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825834 (https://phabricator.wikimedia.org/T314750) [16:37:20] (03CR) 10Andrew Bogott: [C: 03+1] "seems good :)" [puppet] - 10https://gerrit.wikimedia.org/r/813826 (https://phabricator.wikimedia.org/T313006) (owner: 10David Caro) [16:39:45] (03Merged) 10jenkins-bot: SpecialRecentChangesLinked: Pass query builder instead of SQL [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825846 (owner: 10Ladsgroup) [16:45:20] (03CR) 10Ssingh: [V: 03+1 C: 03+2] dnsrecursor: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/820654 (owner: 10Muehlenhoff) [16:46:03] (03Merged) 10jenkins-bot: SpecialRecentChangesLinked: Pass query builder instead of SQL [core] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/825847 (owner: 10Ladsgroup) [16:47:23] (03Merged) 10jenkins-bot: rdbms: Switch to getConnectionInternal() in getPrimaryPos() [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825848 (owner: 10Ladsgroup) [16:47:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:48:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:48:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:49:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:50:11] 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 3 others: Clean up the rdf-streaming-updater-codfw container from thanos-swift. - https://phabricator.wikimedia.org/T316031 (10bking) [16:50:22] 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 3 others: Clean up the rdf-streaming-updater-codfw container from thanos-swift. - https://phabricator.wikimedia.org/T316031 (10bking) [16:50:35] 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 3 others: Clean up the rdf-streaming-updater-codfw container from thanos-swift. - https://phabricator.wikimedia.org/T316031 (10bking) [16:53:29] (03PS1) 10Krinkle: Enable wgKartographerStaticMapframe on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825840 (https://phabricator.wikimedia.org/T314750) [16:54:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:58:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:58:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:59:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:00:15] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.26/includes/specials/SpecialRecentChangesLinked.php: Backport: [[gerrit:825847|SpecialRecentChangesLinked: Pass query builder instead of SQL]] (duration: 03m 34s) [17:03:14] RECOVERY - Check systemd state on ms-be2039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:04] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:21] (03CR) 10David Caro: "I'm trying to run the test by using:" [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [17:16:44] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:17:32] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.25/includes/libs/rdbms/loadbalancer/LoadBalancer.php: Backport: [[gerrit:825848|rdbms: Switch to getConnectionInternal() in getPrimaryPos()]] (duration: 03m 27s) [17:19:02] (03PS1) 10Jdlrobson: Clean up main menu selectors [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825849 [17:20:37] 10SRE-swift-storage, 10Infrastructure-Foundations: rsync::server::module installs an rsync server even when $ensure is absent - https://phabricator.wikimedia.org/T311066 (10Clement_Goubert) I'm having the same issue with thanos-fe[1-2]00[1-2] servers. There's a combination of issues at play here from what I ca... [17:21:29] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.25/includes/specials/SpecialRecentChangesLinked.php: Backport: [[gerrit:825846|SpecialRecentChangesLinked: Pass query builder instead of SQL]] (duration: 03m 32s) [17:21:40] (03Restored) 10Jdlrobson: Remove grid row gap in favor of margins [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825752 (https://phabricator.wikimedia.org/T315595) (owner: 10Jdlrobson) [17:21:58] (03PS2) 10Jdlrobson: Remove grid row gap in favor of margins [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825752 (https://phabricator.wikimedia.org/T315595) [17:30:35] 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 3 others: Clean up the rdf-streaming-updater-codfw container from thanos-swift. - https://phabricator.wikimedia.org/T316031 (10bking) Per conversation with @dcausse , we need to keep all objects within the T314835 (pseudo) folder in t... [17:33:07] !log 'bking@cumin starting thanos-swift cleanup for wdqs T316031' [17:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:11] T316031: Clean up the rdf-streaming-updater-codfw container from thanos-swift. - https://phabricator.wikimedia.org/T316031 [17:33:49] (03CR) 10Cwhite: [C: 03+2] tcpircbot: send !log events to log stream [puppet] - 10https://gerrit.wikimedia.org/r/822422 (https://phabricator.wikimedia.org/T257861) (owner: 10Cwhite) [17:35:16] (03PS1) 10Hashar: Revert "Gerrit v3.4.5 and rebuild plugins" [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/825845 (https://phabricator.wikimedia.org/T315942) [17:35:56] I am going to downgrade Gerrit after mutante investigation. The replication to gerrit-replica.wikimedia.org stopped working since the 3.4.4 to 3.4.5 upgrade :-\ [17:36:47] (03CR) 10Dzahn: [C: 03+1] "let's try this if even just to confirm the kex algo error is gone and replication works again" [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/825845 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar) [17:37:04] !log restart tcpircbot T257861 [17:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:10] T257861: Pipe SAL entries into Logstash - https://phabricator.wikimedia.org/T257861 [17:37:26] (03CR) 10Hashar: [C: 03+2] Revert "Gerrit v3.4.5 and rebuild plugins" [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/825845 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar) [17:37:47] (03Merged) 10jenkins-bot: Revert "Gerrit v3.4.5 and rebuild plugins" [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/825845 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar) [17:38:24] (03CR) 10CI reject: [V: 04-1] Remove grid row gap in favor of margins [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825752 (https://phabricator.wikimedia.org/T315595) (owner: 10Jdlrobson) [17:39:13] !log hashar@deploy1002 Started deploy [gerrit/gerrit@e11e6a7]: Revert Gerrit from 3.4.5 to 3.4.4 # T315942 [17:39:17] T315942: Diffusion mirrors of Gerrit repos not showing commits made since August 17 - https://phabricator.wikimedia.org/T315942 [17:39:17] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@e11e6a7]: Revert Gerrit from 3.4.5 to 3.4.4 # T315942 (duration: 00m 04s) [17:39:47] !log hashar@deploy1002 Started deploy [gerrit/gerrit@cb7edfb]: Revert Gerrit from 3.4.5 to 3.4.4 # T315942 [17:39:55] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@cb7edfb]: Revert Gerrit from 3.4.5 to 3.4.4 # T315942 (duration: 00m 08s) [17:41:10] !log Stopping Gerrit [17:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:06] 10SRE, 10Data-Engineering, 10Traffic: Spike: Investigate creating robust alerts to notify that caching nodes are not sending traffic data - https://phabricator.wikimedia.org/T304651 (10Milimetric) 05Openβ†’03Declined I'm declining this in favor of other work Ben is doing to improve the alert. I think this... [17:43:54] Gerrit is back to 3.4.4 [17:44:25] hashar: thanks, and replication is indeed back it seems [17:44:43] Push to gerrit2@gerrit2002.wikimedia.org:/srv/gerrit/git/mediawiki/services/parsoid.git... [17:44:52] great [17:44:59] sorry for the original dismissing of rolling back gerrit [17:45:16] I missed it was a patch level rollback and that would have been trivial to do [17:45:34] alright, good [17:46:32] (03CR) 10Dzahn: [C: 03+1] "it did indeed fix replication:" [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/825845 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar) [17:46:56] 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 3 others: Clean up the rdf-streaming-updater-codfw container from thanos-swift. - https://phabricator.wikimedia.org/T316031 (10bking) Swiftly is running in a tmux window on cumin1001. Command run: `swiftly --cache-auth --eventlet --c... [18:00:04] hashar and dduvall: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220823T1800). [18:02:37] (03PS2) 10Cwhite: logstash: add tcpircbot logging tests [puppet] - 10https://gerrit.wikimedia.org/r/824317 (https://phabricator.wikimedia.org/T257861) [18:05:36] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) on cr2 interface setup complete ` papaul@re0.cr2-eqiad# run show interfaces terse | match xe-1/1/* xe-1/1/0:0 down down xe-1/1/0:1... [18:06:58] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) @ayounsi everything it ready on the routers to start moving the links. Sorry i am late on this had to finished with the PDU's maintenance. [18:10:55] 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 3 others: Clean up the rdf-streaming-updater-codfw container from thanos-swift. - https://phabricator.wikimedia.org/T316031 (10dcausse) [18:15:46] !log Restarting CI Jenkins [18:15:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:04] (03CR) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [18:42:31] (03CR) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [18:47:18] (03CR) 10Bernard Wang: [C: 03+1] "This LGTM! I tested manually and the extra spacing is removed and I couldn't find any layout issues. Pixel mostly had link color errors an" [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825752 (https://phabricator.wikimedia.org/T315595) (owner: 10Jdlrobson) [18:54:13] (KubernetesRsyslogDown) firing: (2) rsyslog on dse-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:06:58] PROBLEM - Check systemd state on ms-be2039 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:08:59] (03PS1) 10Ryan Kemper: elastic: clear old es_6 resources during upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/825874 (https://phabricator.wikimedia.org/T308676) [19:12:33] (03CR) 10CI reject: [V: 04-1] elastic: clear old es_6 resources during upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/825874 (https://phabricator.wikimedia.org/T308676) (owner: 10Ryan Kemper) [19:21:44] PROBLEM - ElasticSearch health check for shards on 9400 on relforge1004 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [19:21:52] PROBLEM - ElasticSearch health check for shards on 9400 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 3 threshold =0.15 breach: cluster_name: relforge-eqiad-small-alpha, status: yellow, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 5, active_shards: 7, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 1, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, [19:21:52] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 70.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:22:46] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1003 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fe6d6fd3280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech [19:22:46] ia.org/wiki/Search%23Administration [19:27:25] 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10JAnstee_WMF) @bcampbell - It seems the last action still has not resolved this issue. Is there a next step that we should be trying or following back... [19:28:26] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1004 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [19:31:08] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:34:08] 10SRE, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T316044 (10jdfraine) As the #WMF-Legal project tag was added to this task, some general information to avoid wrong expectations: Please note that public tasks in Wi... [19:34:45] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: T315604 [19:34:50] T315604: Upgrade relforge cluster to 7.10.2 - https://phabricator.wikimedia.org/T315604 [19:34:59] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: T315604 [19:46:16] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:47:50] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T316044 (10RhinosF1) [19:49:58] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T316044 (10RhinosF1) 05Openβ†’03Stalled Hi @JDFraine: I see you've added a completed checklist but it doesn't look like this is done. Can you please follow the instructions on... [19:57:04] RECOVERY - ElasticSearch health check for shards on 9400 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad-small-alpha: cluster_name: relforge-eqiad-small-alpha, status: red, timed_out: False, number_of_nodes: 1, number_of_data_nodes: 1, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_ [19:57:04] , task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [19:57:59] (03CR) 10Jdrewniak: "recheck" [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825752 (https://phabricator.wikimedia.org/T315595) (owner: 10Jdlrobson) [20:00:05] RoanKattouw, Urbanecm, and cjming: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220823T2000). [20:00:05] koi: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:20] hi [20:00:32] (03PS1) 10Cwhite: logstash: duplicate sal logs to Loki [puppet] - 10https://gerrit.wikimedia.org/r/825880 (https://phabricator.wikimedia.org/T257861) [20:00:59] hello, I have a last-minute addition to the backport window [20:01:08] hello, i can deploy today [20:01:13] jan_drewniak: can you please add it to calendar? [20:01:33] (03PS2) 10Urbanecm: trwikiquote: Enable block feature of abusefilter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825334 (https://phabricator.wikimedia.org/T315736) (owner: 10Stang) [20:01:46] (03CR) 10Urbanecm: [C: 03+2] trwikiquote: Enable block feature of abusefilter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825334 (https://phabricator.wikimedia.org/T315736) (owner: 10Stang) [20:02:17] urbanecm: yup just did, CI is still verifying I think [20:02:20] thanks [20:02:32] https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/825752 looks to fail CI [20:02:33] (03Merged) 10jenkins-bot: trwikiquote: Enable block feature of abusefilter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825334 (https://phabricator.wikimedia.org/T315736) (owner: 10Stang) [20:02:39] and it also depends on https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/825849/1 [20:02:44] can you check the CI please jan_drewniak? [20:03:12] RECOVERY - Check systemd state on ms-be2039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:03:22] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 251, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:03:28] koi: not sure if it's testable, but pulled to mwdebug1001 [20:04:30] urbanecm: it's testable, and LGTM [20:04:31] urbanecm: yeah I think my patch still needs editing... whitespace issues. I'll be back in a sec [20:04:36] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:04:58] koi: thanks, syncing [20:05:00] jan_drewniak: ack. [20:05:47] (03PS3) 10Jdrewniak: Remove grid row gap in favor of margins [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825752 (https://phabricator.wikimedia.org/T315595) (owner: 10Jdlrobson) [20:06:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:07:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:07:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:08:16] !log urbanecm@deploy1002 Synchronized wmf-config/abusefilter.php: 8fb3575f054c7faa2f5415658c9f169bc6f6e227: trwikiquote: Enable block feature of abusefilter (T315736) (duration: 02m 57s) [20:08:20] T315736: Enable the block feature of AbuseFilter on trwikiquote - https://phabricator.wikimedia.org/T315736 [20:08:23] koi: should be live [20:08:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:17:36] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1004 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 146, active_shards: 292, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max [20:17:36] _in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:17:58] RECOVERY - ElasticSearch health check for shards on 9400 on relforge1004 is OK: OK - elasticsearch status relforge-eqiad-small-alpha: cluster_name: relforge-eqiad-small-alpha, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 5, active_shards: 10, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flig [20:17:58] : 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:19:04] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 146, active_shards: 292, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max [20:19:04] _in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:23:58] urbanecm: just fyi, my patches are finally ready. [20:24:07] thanks for the ping [20:24:11] checking now [20:24:23] it's marked as depends-on https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/825849/1 [20:24:28] do both patches need to be deployed jan_drewniak? [20:25:20] urbanecm: that's correct. (I just added both to the schedule) [20:25:27] okay [20:25:42] (03CR) 10Urbanecm: [C: 03+2] Clean up main menu selectors [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825849 (owner: 10Jdlrobson) [20:25:48] i'll let you know once they can be tested [20:25:50] (03CR) 10Urbanecm: [C: 03+2] Remove grid row gap in favor of margins [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825752 (https://phabricator.wikimedia.org/T315595) (owner: 10Jdlrobson) [20:25:51] (03PS1) 10Ryan Kemper: elastic: es7 removed bulk threadpool [puppet] - 10https://gerrit.wikimedia.org/r/825883 (https://phabricator.wikimedia.org/T308676) [20:26:28] jan_drewniak: should https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/825401 be deployed to wmf.26 as well? [20:26:43] (03CR) 10Dzahn: mediawiki: Split updateSpecialPages.php job to be per-shard (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/804788 (https://phabricator.wikimedia.org/T307314) (owner: 10Legoktm) [20:26:48] (03PS9) 10Dzahn: mediawiki: Split updateSpecialPages.php job to be per-shard [puppet] - 10https://gerrit.wikimedia.org/r/804788 (https://phabricator.wikimedia.org/T307314) (owner: 10Legoktm) [20:27:23] urbanecm: shoot yeah I guess both of them should [20:27:34] mutante: if you have time to push that forward, please do! I won't have time till mid-September [20:27:39] the parent made the wmf.26 cut [20:27:55] (03PS2) 10Ryan Kemper: elastic: es7 removed bulk threadpool [puppet] - 10https://gerrit.wikimedia.org/r/825883 (https://phabricator.wikimedia.org/T308676) [20:28:01] * urbanecm likes the "Included in" button in Gerrit [20:28:10] urbanecm: sorry, yeah I just saw that now [20:28:17] no problem [20:28:17] (03PS1) 10Urbanecm: Remove grid row gap in favor of margins [skins/Vector] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/825889 (https://phabricator.wikimedia.org/T315595) [20:28:25] (03CR) 10Urbanecm: [C: 03+2] Remove grid row gap in favor of margins [skins/Vector] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/825889 (https://phabricator.wikimedia.org/T315595) (owner: 10Urbanecm) [20:28:33] legoktm: since I just randomly ran into that and saw that TODO..i think i'll JFDI [20:28:41] :D [20:29:18] (03PS2) 10Jdrewniak: Remove grid row gap in favor of margins [skins/Vector] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/825889 (https://phabricator.wikimedia.org/T315595) (owner: 10Urbanecm) [20:29:41] (03CR) 10Urbanecm: [C: 03+2] Remove grid row gap in favor of margins [skins/Vector] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/825889 (https://phabricator.wikimedia.org/T315595) (owner: 10Urbanecm) [20:29:43] re-+2'ing [20:30:20] (03CR) 10Dzahn: [C: 03+2] "I just ran into this unrelatedly , ended up at https://phabricator.wikimedia.org/T307314#8179490 and then was linked here. based on all th" [puppet] - 10https://gerrit.wikimedia.org/r/804788 (https://phabricator.wikimedia.org/T307314) (owner: 10Legoktm) [20:32:24] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:32:28] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:33:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:47:18] PROBLEM - PHP7 jobrunner on mw1445 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [20:48:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:49:32] RECOVERY - PHP7 jobrunner on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 7.808 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [20:49:50] PROBLEM - PHP7 rendering on mw1445 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:51:10] mw1445 is overloaded [20:51:19] with videoscaling :/ [20:51:29] mutante: rzl: can you assist with that please? [20:53:56] urbanecm: I don't think there is much to be done besides letting it finish the job, that is jobrunner/videoscaler, it's not appserver/api [20:54:16] PROBLEM - PHP7 jobrunner on mw1445 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [20:54:32] RECOVERY - PHP7 rendering on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 7.570 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:55:06] mutante: okay, thanks. trying to make sure it's not a reason to worry when doing MW deployments :). [20:55:57] urbanecm: thanks for checking. but in this case, yea, i confirm the server is up, it's doing things with ffmpeg and I think you don't have to worry about the deployment [20:56:05] thanks! [20:56:30] RECOVERY - PHP7 jobrunner on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 8.698 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [20:56:32] I am going to deploy a change to cronjobs on mwmaint [20:56:38] for the update_special_pages [20:56:51] I mean timers of course. [20:57:24] (03Merged) 10jenkins-bot: Clean up main menu selectors [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825849 (owner: 10Jdlrobson) [20:57:33] (03Merged) 10jenkins-bot: Remove grid row gap in favor of margins [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825752 (https://phabricator.wikimedia.org/T315595) (owner: 10Jdlrobson) [20:57:34] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:57:41] (03Merged) 10jenkins-bot: Remove grid row gap in favor of margins [skins/Vector] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/825889 (https://phabricator.wikimedia.org/T315595) (owner: 10Urbanecm) [20:57:48] here we go :) [20:58:14] πŸ‘ [20:58:38] jan_drewniak: pulled to mwdebug1001, can you check please? [20:59:26] (03CR) 10Dzahn: [C: 03+2] "first ran puppet on mwmaint2002 - got unexpected: Unknown resource type: 'profile::mediawiki::sharded_periodic_job'" [puppet] - 10https://gerrit.wikimedia.org/r/804788 (https://phabricator.wikimedia.org/T307314) (owner: 10Legoktm) [21:00:02] 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10bcampbell) Hey @JAnstee_WMF, I've been working with Tanja more on this on our Zendesk ticket. The next step I proposed was to set up a meeting with u... [21:00:11] (03CR) 10Dzahn: [C: 03+2] "had not realized https://gerrit.wikimedia.org/r/c/operations/puppet/+/804800/2 was still WIP" [puppet] - 10https://gerrit.wikimedia.org/r/804788 (https://phabricator.wikimedia.org/T307314) (owner: 10Legoktm) [21:01:08] PROBLEM - PHP7 jobrunner on mw1445 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [21:02:22] PROBLEM - PHP7 rendering on mw1445 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:03:18] RECOVERY - PHP7 jobrunner on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.025 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [21:03:58] urbanecm: hey, I'm sorry for all the trouble here, but I'm inspecting the patch now, and it looks like it actually makes the problem worse (for me at least) so I'm not feeling confident about syncing it. [21:04:14] no worries jan_drewniak. so, let's revert all three patches? [21:04:23] (03CR) 10Dzahn: "I already merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/804788 which uses shared_periodic_job. Should have done this first. " [puppet] - 10https://gerrit.wikimedia.org/r/804800 (owner: 10Legoktm) [21:04:33] I gotta sync with my team on what's going on with it... yeah. I guess revert. [21:04:34] RECOVERY - PHP7 rendering on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.512 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:04:37] doing [21:04:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:04:59] (03PS1) 10Urbanecm: Revert "Clean up main menu selectors" [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825890 [21:05:15] (03PS1) 10Urbanecm: Revert "Remove grid row gap in favor of margins" [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825891 [21:05:20] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] Revert "Remove grid row gap in favor of margins" [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825891 (owner: 10Urbanecm) [21:05:28] (03PS2) 10Urbanecm: Revert "Clean up main menu selectors" [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825890 [21:05:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:05:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:05:40] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] "revert" [skins/Vector] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825890 (owner: 10Urbanecm) [21:05:57] (03PS1) 10Urbanecm: Revert "Remove grid row gap in favor of margins" [skins/Vector] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/825892 [21:06:03] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] Revert "Remove grid row gap in favor of margins" [skins/Vector] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/825892 (owner: 10Urbanecm) [21:06:06] PROBLEM - Check systemd state on ms-be2039 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:06:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:06:37] jan_drewniak: all reverted. [21:07:31] urbanecm: thanks, and sorry for all the hassle! [21:07:41] it happens :). good luck with figuring out the issue! [21:08:33] I merged something in the wrong order in puppet. so no puppet run on mwmaint right now. But I'm on it. [21:10:16] PROBLEM - PHP7 rendering on mw1445 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:11:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:11:34] (03PS9) 10Dzahn: Add profile::mediawiki::sharded_periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/804800 (owner: 10Legoktm) [21:12:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:12:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:12:24] (03CR) 10Dzahn: "unexpected rebase results...certainly don't want to delete all these, heh" [puppet] - 10https://gerrit.wikimedia.org/r/804800 (owner: 10Legoktm) [21:12:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:17:10] RECOVERY - PHP7 rendering on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 2.538 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:17:19] (03PS10) 10Dzahn: Add profile::mediawiki::sharded_periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/804800 (owner: 10Legoktm) [21:18:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:18:05] (03CR) 10Dzahn: [C: 03+1] "reduced this change to match exactly what it says, add profile::mediawiki::shared_periodic_job but only that" [puppet] - 10https://gerrit.wikimedia.org/r/804800 (owner: 10Legoktm) [21:18:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:18:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:19:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:22:05] (03CR) 10Dzahn: [C: 03+1] Add profile::mediawiki::sharded_periodic_job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/804800 (owner: 10Legoktm) [21:23:12] (03CR) 10Dzahn: [C: 03+2] Add profile::mediawiki::sharded_periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/804800 (owner: 10Legoktm) [21:26:37] (03PS1) 10Brennen Bearnes: scap: separate new rev perms from old rev perm cleanup [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/825911 (https://phabricator.wikimedia.org/T313953) [21:27:13] (03CR) 10Dzahn: [C: 03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/804800 (owner: 10Legoktm) [21:31:16] PROBLEM - Disk space on ms-be2039 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=89%): /tmp 0 MB (0% inode=89%): /var/tmp 0 MB (0% inode=89%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2039&var-datasource=codfw+prometheus/ops [21:31:46] (03CR) 10Jeena Huneidi: [C: 03+1] scap: separate new rev perms from old rev perm cleanup [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/825911 (https://phabricator.wikimedia.org/T313953) (owner: 10Brennen Bearnes) [21:33:22] (03PS2) 10Brennen Bearnes: scap: separate new rev perms from old rev perm cleanup [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/825911 (https://phabricator.wikimedia.org/T313953) [21:37:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:45:27] (03CR) 10Dzahn: [C: 03+2] "on mwmaint2002 this created all the timers, snippets in /etc/logrotate.d/, etc.. just like with existing mediawiki_job_updatequerypages*" [puppet] - 10https://gerrit.wikimedia.org/r/804800 (owner: 10Legoktm) [21:47:16] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:47:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:51:01] (03CR) 10Brennen Bearnes: "This covers a chunk of what Dan and I talked about yesterday; seems to work in devtools." [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/825911 (https://phabricator.wikimedia.org/T313953) (owner: 10Brennen Bearnes) [22:10:36] (03CR) 10Dzahn: [C: 03+2] "[mwmaint1002:~] $ file /lib/systemd/system/mediawiki_job_update_special_pages_*" [puppet] - 10https://gerrit.wikimedia.org/r/804800 (owner: 10Legoktm) [22:10:59] (03CR) 10Dzahn: [C: 03+2] "[mwmaint1002:~] $ file /lib/systemd/system/mediawiki_job_update_special_pages_*" [puppet] - 10https://gerrit.wikimedia.org/r/804788 (https://phabricator.wikimedia.org/T307314) (owner: 10Legoktm) [22:12:30] PROBLEM - Check for large files in client bucket on mwmaint1002 is CRITICAL: WARNING: large files in client bucket https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [22:16:51] !log dancy@deploy1002 Testing. Ignore [22:23:47] so the "large files in client bucket" is funnysad [22:24:02] find /var/lib/puppet/clientbucket -type f -size +100M | while read line ; do cat "$(dirname ${line})"/paths ; done | uniq [22:24:05] /var/log/mediawiki/mediawiki_job_update_special_pages/syslog.log [22:24:08] /var/log/mediawiki/mediawiki_job_update_special_pages/syslog.log.1 [22:24:15] bash: cd: /var/log/mediawiki/mediawiki_job_update_special_pages/: No such file or directory [22:24:26] it's because I merged a change which told puppet to remove a timer [22:24:30] which makes it remove the log dir [22:24:41] which means it had the syslog.log in client bucket [22:25:45] " Puppet is not great at managing large files in general but especially if it needs to disable. " [22:25:49] https://wikitech.wikimedia.org/wiki/Puppet#check_client_bucket_large_file [22:31:27] !log mwmaint1002 - find /var/lib/puppet/clientbucket -type f -size +100M -delete [22:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:36] RECOVERY - Check for large files in client bucket on mwmaint1002 is OK: OK: client bucket file ok https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [22:35:52] RhinosF1: ^ [22:36:33] mutante: yey, my advice didn't earn me a t-shirt! [22:37:05] RhinosF1: we need to do pants. nobody does pants. people have 100 IT shirts and one pair of jeans [22:37:45] mutante: nobody sells jeans that fit me [22:40:25] (03PS1) 10Ori: Restart incremental roll-out of query-sorting at 1% [puppet] - 10https://gerrit.wikimedia.org/r/825917 (https://phabricator.wikimedia.org/T314868) [22:43:44] (03CR) 10Ori: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36905/console" [puppet] - 10https://gerrit.wikimedia.org/r/825917 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori) [22:54:13] (KubernetesRsyslogDown) firing: (2) rsyslog on dse-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:59:48] (03CR) 10RLazarus: Add profile::mediawiki::sharded_periodic_job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/804800 (owner: 10Legoktm) [23:03:42] (03PS4) 10BCornwall: Varnish: Stop sending analytics cookies to API [puppet] - 10https://gerrit.wikimedia.org/r/824793 (https://phabricator.wikimedia.org/T260943) [23:05:26] (03PS5) 10BCornwall: Varnish: Stop sending analytics cookies to API [puppet] - 10https://gerrit.wikimedia.org/r/824793 (https://phabricator.wikimedia.org/T260943) [23:08:11] (03PS1) 10RLazarus: mediawiki: Rename sharded_periodic_job's `command` param to `script` [puppet] - 10https://gerrit.wikimedia.org/r/825920 [23:08:58] (03CR) 10CI reject: [V: 04-1] mediawiki: Rename sharded_periodic_job's `command` param to `script` [puppet] - 10https://gerrit.wikimedia.org/r/825920 (owner: 10RLazarus) [23:09:34] (03PS2) 10RLazarus: mediawiki: Rename sharded_periodic_job's `command` param to `script` [puppet] - 10https://gerrit.wikimedia.org/r/825920 [23:10:00] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frlog1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T315924 (10wiki_willy) a:03Jclark-ctr [23:11:01] (03CR) 10Dzahn: [C: 03+1] "ACK, Lego did say that..already on it" [puppet] - 10https://gerrit.wikimedia.org/r/825920 (owner: 10RLazarus) [23:12:13] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36907/console" [puppet] - 10https://gerrit.wikimedia.org/r/825920 (owner: 10RLazarus) [23:12:44] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1003/36906/mwmaint1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/825920 (owner: 10RLazarus) [23:13:10] 10SRE, 10ops-codfw, 10Discovery-Search: elastic2054 is down with memory error - https://phabricator.wikimedia.org/T315989 (10wiki_willy) a:03Papaul [23:14:49] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: cp5001 memory errors on DIMM A2 - https://phabricator.wikimedia.org/T314256 (10wiki_willy) a:03RobH [23:15:46] (03CR) 10Dzahn: [C: 03+2] "noop on mwmaint*" [puppet] - 10https://gerrit.wikimedia.org/r/825920 (owner: 10RLazarus) [23:16:31] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: cp5001 memory errors on DIMM A2 - https://phabricator.wikimedia.org/T314256 (10wiki_willy) Assigning over to Rob, who's currently working on getting the eqsin hardware refresh ordered. [23:18:50] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:22:24] (03PS6) 10BCornwall: Varnish: Stop sending analytics cookies to API [puppet] - 10https://gerrit.wikimedia.org/r/824793 (https://phabricator.wikimedia.org/T260943) [23:22:43] (03PS1) 10Dzahn: spamassassin: remove absented cron file [puppet] - 10https://gerrit.wikimedia.org/r/825924 (https://phabricator.wikimedia.org/T273673) [23:24:17] (03CR) 10BCornwall: Varnish: Stop sending analytics cookies to API (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/824793 (https://phabricator.wikimedia.org/T260943) (owner: 10BCornwall) [23:25:10] (03PS1) 10Ebernhardson: query_service: Avoid passing content body to internal auth endpoints [puppet] - 10https://gerrit.wikimedia.org/r/825925 (https://phabricator.wikimedia.org/T306899) [23:25:38] (03PS2) 10Ebernhardson: query_service: Avoid passing content body to internal auth endpoints [puppet] - 10https://gerrit.wikimedia.org/r/825925 (https://phabricator.wikimedia.org/T306899) [23:25:40] (03CR) 10BCornwall: "Looks like a number of tests are failing now. Was this expected?" [software/acme-chief] - 10https://gerrit.wikimedia.org/r/820795 (https://phabricator.wikimedia.org/T244232) (owner: 10BCornwall) [23:34:25] (03CR) 10Ebernhardson: [C: 03+1] elastic: es7 removed bulk threadpool [puppet] - 10https://gerrit.wikimedia.org/r/825883 (https://phabricator.wikimedia.org/T308676) (owner: 10Ryan Kemper) [23:35:56] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T316044 (10Aklapper) If there are some WMDE onboarding docs, then please make these docs point to https://phabricator.wikimedia.org/tag/ldap-access-requests/ - thanks. [23:48:40] (03PS1) 10Andrew Bogott: Eqiad designate -> OpenStack version Xena [puppet] - 10https://gerrit.wikimedia.org/r/825927 (https://phabricator.wikimedia.org/T296561)