[00:08:09] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:13:59] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:20:09] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:21:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:22:05] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:25:23] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:31:00] (03PS1) 10Tim Starling: Microsecond timestamp resolution in UDP logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820904 [00:31:39] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:35:37] (03CR) 10Tim Starling: "I tested it in production by making this change on mwdebug2001 and then sending a request to it with X-Wikimedia-Debug." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820904 (owner: 10Tim Starling) [00:39:19] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:42:05] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:42:51] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:53:43] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:12:55] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:24:55] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:27:18] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) I tried a warmup request followed by another request for the same page view, the second having MW logging enabled with [[https://gerrit.wikimedia.... [01:37:45] (JobUnavailable) firing: (3) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:44:07] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:46:47] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:52:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:56:09] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:56:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [01:59:47] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [02:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220807T0700) [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220808T0200) [02:07:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:53] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:09:16] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) I tested parse times with `ab -n10 -H'X-Forwarded-Proto: https' -X mw1441.eqiad.wmnet:80 'http://test2.wikipedia.org/w/api.php?action=parse&format... [02:10:21] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) [02:15:17] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:17:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:22:45] (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:27:15] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:46:27] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:56:35] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:58:25] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:04:51] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:17:35] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:29:35] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:48:05] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:51:37] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [03:58:49] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 8 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [04:02:29] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:21:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:21:37] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:33:37] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:52:47] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:58:39] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:04:49] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:10:50] (03PS2) 10KartikMistry: Enable SectionTranslation on 10 Wikipedias where ContentTranslation is default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820261 (https://phabricator.wikimedia.org/T308829) [05:15:24] (03CR) 10Tim Starling: "So how's it looking? Was the test successful?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/146849 (owner: 10Bsitu) [05:23:59] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:29:08] 10SRE, 10Performance-Team, 10serviceops: Clean up testwiki experiments - https://phabricator.wikimedia.org/T314750 (10tstarling) [05:34:13] (03PS1) 10Tim Starling: Remove abandoned Echo job queue test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821041 (https://phabricator.wikimedia.org/T314750) [05:35:57] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:44:13] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:44:35] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:50:26] (03PS1) 10Tim Starling: Remove testwiki example.org link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821042 (https://phabricator.wikimedia.org/T314750) [05:50:28] (03PS1) 10Tim Starling: Remove wgVectorResponsive [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821043 (https://phabricator.wikimedia.org/T314750) [05:50:30] (03PS1) 10Tim Starling: Remove override for wgRevisionCacheExpiry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821044 (https://phabricator.wikimedia.org/T314750) [05:55:07] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:56:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:00:47] (03PS1) 10Tim Starling: Remove testwiki wgTorTagChanges override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821045 (https://phabricator.wikimedia.org/T314750) [06:00:49] (03PS1) 10Tim Starling: Remove testwiki live preview demo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821046 (https://phabricator.wikimedia.org/T314750) [06:07:09] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:26:23] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:59] (03PS1) 10Tim Starling: Remove unnecessary override for wmgUseCLDR [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821047 (https://phabricator.wikimedia.org/T314750) [06:31:01] (03PS1) 10Tim Starling: Remove wmgDisplayFeedsInSidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821048 (https://phabricator.wikimedia.org/T314750) [06:31:03] (03PS1) 10Tim Starling: Remove wmgUseWikimediaShopLink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821049 (https://phabricator.wikimedia.org/T310365) [06:36:32] 10SRE, 10serviceops: eqiad (2) memcached host for wikifunctions service implementation tracking - https://phabricator.wikimedia.org/T313965 (10Joe) [06:38:25] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:41:15] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:42:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install new eqiad memcached hosts - https://phabricator.wikimedia.org/T313963 (10Joe) a:05Joe→03None [06:42:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install new eqiad memcached hosts - https://phabricator.wikimedia.org/T313963 (10Joe) @RobH all info should be filled in now. [06:43:40] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install new codfw memcached hosts - https://phabricator.wikimedia.org/T313966 (10Joe) a:05Joe→03None [06:44:05] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install new codfw memcached hosts - https://phabricator.wikimedia.org/T313966 (10Joe) a:03Papaul @RobH the task should be complete with all the info, reassigning to Papaul [06:44:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install new eqiad memcached hosts - https://phabricator.wikimedia.org/T313963 (10Joe) a:03Jclark-ctr [06:45:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install new eqiad memcached hosts - https://phabricator.wikimedia.org/T313963 (10Joe) [06:45:30] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install new codfw memcached hosts - https://phabricator.wikimedia.org/T313966 (10Joe) [06:57:31] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:58:23] PROBLEM - SSH on restbase2012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:00:04] Amir1 and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220808T0700). [07:00:04] kart_ and koi: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:15] * kart_ is here. [07:00:24] and will self-deploy.. [07:00:42] (03CR) 10KartikMistry: [C: 03+2] Enable SectionTranslation on 10 Wikipedias where ContentTranslation is default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820261 (https://phabricator.wikimedia.org/T308829) (owner: 10KartikMistry) [07:00:54] hi kart_, would you like to also deploy my patch :) [07:01:04] koi: sure! Let me check. [07:02:58] (03Merged) 10jenkins-bot: Enable SectionTranslation on 10 Wikipedias where ContentTranslation is default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820261 (https://phabricator.wikimedia.org/T308829) (owner: 10KartikMistry) [07:04:37] (03CR) 10Ayounsi: [V: 03+1 C: 03+2] Netbox: add CSP headers [puppet] - 10https://gerrit.wikimedia.org/r/820645 (https://phabricator.wikimedia.org/T296356) (owner: 10Ayounsi) [07:05:37] !log restart rsyslog on ml-serve-ctrl2001 [07:05:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:44] !log add CSP headers to Netbox - T296356 [07:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:09] koi: Deploying my patch. I'll ping when your patch is ready for testing.. [07:09:29] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:09:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:09:58] !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820261|Enable SectionTranslation on 10 Wikipedias where ContentTranslation is default (T308829)]] (duration: 03m 15s) [07:10:00] T308829: Enable Section Translation on 10 Wikipedias where Content Translation is available by default - https://phabricator.wikimedia.org/T308829 [07:10:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:10:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:10:59] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:11:14] (03PS2) 10KartikMistry: trwikivoyage: Create rollbacker user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820815 (https://phabricator.wikimedia.org/T314678) (owner: 10Stang) [07:11:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T312863)', diff saved to https://phabricator.wikimedia.org/P32306 and previous config saved to /var/cache/conftool/dbconfig/20220808-071144-ladsgroup.json [07:11:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:11:50] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [07:11:54] !log restart rsyslog on ml-serve2007 [07:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:46] (03CR) 10KartikMistry: [C: 03+2] "UTC Morning Config Deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820815 (https://phabricator.wikimedia.org/T314678) (owner: 10Stang) [07:14:48] (03Merged) 10jenkins-bot: trwikivoyage: Create rollbacker user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820815 (https://phabricator.wikimedia.org/T314678) (owner: 10Stang) [07:15:58] (KubernetesRsyslogDown) resolved: (2) rsyslog on ml-serve-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:16:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:17:05] koi: Please test patch on mwdebug1001 [07:17:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:17:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:18:45] kart_: tested and LGTM [07:18:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:19:09] koi: cool. Deploying.. [07:19:47] (03PS1) 10Kevin Bazira: ml-services: Add arwiki, cswiki & enwiki drafttopic isvcs to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/821168 (https://phabricator.wikimedia.org/T314456) [07:22:26] (03CR) 10CI reject: [V: 04-1] ml-services: Add arwiki, cswiki & enwiki drafttopic isvcs to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/821168 (https://phabricator.wikimedia.org/T314456) (owner: 10Kevin Bazira) [07:22:46] !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820815|trwikivoyage: Create rollbacker user group (T314678)]] (duration: 03m 17s) [07:22:48] (03CR) 10Ayounsi: [C: 03+2] Netbox: add hourly postgres backups [puppet] - 10https://gerrit.wikimedia.org/r/820463 (https://phabricator.wikimedia.org/T262677) (owner: 10Ayounsi) [07:22:49] T314678: Add rollbacker user group to trwikivoyage - https://phabricator.wikimedia.org/T314678 [07:23:23] koi: Done. [07:23:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:23:57] thanks a lot! [07:24:17] I need to go for quick lunch + meeting now. If any other deployers available, config deployment window has still approx 35 minutes left.. [07:25:29] Oh, I've patch, but I forgot to even submit it. Tomorrow maybe! [07:50:15] (03PS1) 10Ayounsi: Netbox backup: only run on the primary node [puppet] - 10https://gerrit.wikimedia.org/r/821173 (https://phabricator.wikimedia.org/T262677) [07:50:26] !log grow sda/sdb 3 by 100G on thanos-be1004 - T314275 [07:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:29] T314275: thanos-be2004 sdb3 fully used - https://phabricator.wikimedia.org/T314275 [07:53:39] PROBLEM - SSH on db1109.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:53:46] !log grow sda/sdb 3 by 100G on thanos-be2001 - T314275 [07:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T312863)', diff saved to https://phabricator.wikimedia.org/P32309 and previous config saved to /var/cache/conftool/dbconfig/20220808-075702-ladsgroup.json [07:57:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [07:57:06] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [07:57:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [07:57:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T312863)', diff saved to https://phabricator.wikimedia.org/P32310 and previous config saved to /var/cache/conftool/dbconfig/20220808-075723-ladsgroup.json [07:57:29] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:59:08] (03PS1) 10Filippo Giunchedi: install_server: set minimum 200G for swift sd[ab]3 [puppet] - 10https://gerrit.wikimedia.org/r/821174 (https://phabricator.wikimedia.org/T314275) [08:00:57] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:01:11] (03PS1) 10Ayounsi: Postgres dumps: add hour and minute to filename [puppet] - 10https://gerrit.wikimedia.org/r/821175 (https://phabricator.wikimedia.org/T262677) [08:03:03] (03PS2) 10Ayounsi: Postgres dumps: add hour and minute to filename [puppet] - 10https://gerrit.wikimedia.org/r/821175 (https://phabricator.wikimedia.org/T262677) [08:06:55] (03CR) 10Ayounsi: [C: 03+2] "Self merging as well as it seems low risk." [puppet] - 10https://gerrit.wikimedia.org/r/821175 (https://phabricator.wikimedia.org/T262677) (owner: 10Ayounsi) [08:09:09] (03CR) 10Filippo Giunchedi: "Thank you! My bad re: syntax" [puppet] - 10https://gerrit.wikimedia.org/r/820800 (https://phabricator.wikimedia.org/T313603) (owner: 10CDanis) [08:09:25] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:52] (03PS1) 10Ayounsi: Netbox: remove CSV dump directory and time [puppet] - 10https://gerrit.wikimedia.org/r/821177 (https://phabricator.wikimedia.org/T310615) [08:25:54] (03PS1) 10Ayounsi: Netbox: remove Puppet config related to CSV dumps [puppet] - 10https://gerrit.wikimedia.org/r/821178 (https://phabricator.wikimedia.org/T310615) [08:26:49] (03CR) 10David Caro: [C: 03+2] wmcs: some yaml autoformatting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/816005 (owner: 10David Caro) [08:28:15] (03PS4) 10David Caro: ceph:osd: add support for multi-network setup [puppet] - 10https://gerrit.wikimedia.org/r/816006 (https://phabricator.wikimedia.org/T309209) [08:28:39] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:29:40] (03PS1) 10Ayounsi: Remove CSV dump scripts [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/821180 (https://phabricator.wikimedia.org/T310615) [08:32:21] (03PS2) 10Ayounsi: Netbox: remove CSV dump directory and timer [puppet] - 10https://gerrit.wikimedia.org/r/821177 (https://phabricator.wikimedia.org/T310615) [08:34:42] (03CR) 10Ayounsi: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1002/36636/" [puppet] - 10https://gerrit.wikimedia.org/r/821177 (https://phabricator.wikimedia.org/T310615) (owner: 10Ayounsi) [08:35:05] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:36:02] (03CR) 10Ayounsi: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1003/36637/" [puppet] - 10https://gerrit.wikimedia.org/r/821178 (https://phabricator.wikimedia.org/T310615) (owner: 10Ayounsi) [08:38:48] (03PS1) 10Jcrespo: Revert "dbbackups: Move s4 eqiad snapshots from db1150 to db1145" [puppet] - 10https://gerrit.wikimedia.org/r/820868 [08:39:03] RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:40:37] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:41:31] (03CR) 10Ladsgroup: [C: 03+1] Revert "dbbackups: Move s4 eqiad snapshots from db1150 to db1145" [puppet] - 10https://gerrit.wikimedia.org/r/820868 (owner: 10Jcrespo) [08:41:35] !log deploy libtirpc update [08:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:51] (03CR) 10Ladsgroup: "I think we need to keep db1132 (10.6) as we are doing a lot of experiments on it, some that's making it go down sometimes: T311106. The re" [puppet] - 10https://gerrit.wikimedia.org/r/820773 (https://phabricator.wikimedia.org/T313569) (owner: 10Jcrespo) [08:46:17] PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:47:43] (03CR) 10Jcrespo: mariadb: Revert a few leftover disabled notif., belived to be wrong (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/820773 (https://phabricator.wikimedia.org/T313569) (owner: 10Jcrespo) [08:48:01] (03PS1) 10David Caro: wmcs: autoformat our yaml files [puppet] - 10https://gerrit.wikimedia.org/r/821181 [08:48:39] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:49:31] (03CR) 10David Caro: [C: 03+2] wmcs: some yaml autoformatting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/816005 (owner: 10David Caro) [08:49:37] (03CR) 10David Caro: "From https://gerrit.wikimedia.org/r/c/operations/puppet/+/816005/2#message-3e6bf118fb0f5d1c209fb907494f3c8b3cff88b8" [puppet] - 10https://gerrit.wikimedia.org/r/821181 (owner: 10David Caro) [08:53:13] (03CR) 10AikoChou: [C: 03+1] ml-services: Add arwiki, cswiki & enwiki drafttopic isvcs to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/821168 (https://phabricator.wikimedia.org/T314456) (owner: 10Kevin Bazira) [08:54:59] RECOVERY - SSH on db1109.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:57:51] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:01:03] RECOVERY - SSH on restbase2012.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:01:08] (03CR) 10Ladsgroup: mariadb: Revert a few leftover disabled notif., belived to be wrong (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/820773 (https://phabricator.wikimedia.org/T313569) (owner: 10Jcrespo) [09:02:11] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:03:54] (03CR) 10Krinkle: [C: 03+1] Remove wmgUseWikimediaShopLink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821049 (https://phabricator.wikimedia.org/T310365) (owner: 10Tim Starling) [09:04:51] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:05:36] (03CR) 10Elukey: "Hey Kevin! The change looks good, but for staging we'd probably need to keep the number of pods low, so probably only enwiki is enough. If" [deployment-charts] - 10https://gerrit.wikimedia.org/r/821168 (https://phabricator.wikimedia.org/T314456) (owner: 10Kevin Bazira) [09:09:10] (03CR) 10Krinkle: [C: 03+1] "Potentially related to T219592 as well. Either we can remove a bunch of code in Echo, or it's unfinished/abandoned solution to T219592 for" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821041 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling) [09:09:14] (03CR) 10Krinkle: [C: 03+1] Remove testwiki example.org link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821042 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling) [09:14:11] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:18:17] 10SRE, 10Data Engineering Planning: Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10BTullis) Yes I am still interested. Adding it to our planning board for discussion. [09:18:47] (03PS1) 10Btullis: Replace underscores with hyphens in dse-k8s-etcd cluster [dns] - 10https://gerrit.wikimedia.org/r/821186 (https://phabricator.wikimedia.org/T313129) [09:19:37] (03CR) 10Jaime Nuche: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/820749 (owner: 10Jaime Nuche) [09:19:43] (03CR) 10Jaime Nuche: "Tested in beta" [puppet] - 10https://gerrit.wikimedia.org/r/820749 (owner: 10Jaime Nuche) [09:20:40] (03PS2) 10Ayounsi: Netbox DB dump, hourly on secondary, daily on primary [puppet] - 10https://gerrit.wikimedia.org/r/821173 (https://phabricator.wikimedia.org/T262677) [09:20:51] (03PS6) 10Btullis: Bootstrap etcd on the dse_k8s_etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/820416 (https://phabricator.wikimedia.org/T313129) [09:23:13] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36638/console" [puppet] - 10https://gerrit.wikimedia.org/r/820416 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [09:23:41] (03CR) 10Krinkle: [C: 03+1] Remove wgVectorResponsive [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821043 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling) [09:23:44] (03CR) 10Krinkle: [C: 03+1] Remove override for wgRevisionCacheExpiry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821044 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling) [09:23:50] (03CR) 10Krinkle: [C: 03+1] Remove testwiki wgTorTagChanges override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821045 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling) [09:23:58] (03CR) 10Krinkle: [C: 03+1] Remove testwiki live preview demo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821046 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling) [09:24:07] (03CR) 10Krinkle: [C: 03+1] Remove unnecessary override for wmgUseCLDR [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821047 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling) [09:24:34] (03CR) 10Elukey: [C: 03+1] Replace underscores with hyphens in dse-k8s-etcd cluster [dns] - 10https://gerrit.wikimedia.org/r/821186 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [09:24:39] PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:25:10] (03PS7) 10Btullis: Bootstrap etcd on the dse_k8s_etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/820416 (https://phabricator.wikimedia.org/T313129) [09:25:20] (03CR) 10Ayounsi: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1001/36639/" [puppet] - 10https://gerrit.wikimedia.org/r/821173 (https://phabricator.wikimedia.org/T262677) (owner: 10Ayounsi) [09:25:32] (03CR) 10Btullis: [C: 03+2] Replace underscores with hyphens in dse-k8s-etcd cluster [dns] - 10https://gerrit.wikimedia.org/r/821186 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [09:26:10] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36640/console" [puppet] - 10https://gerrit.wikimedia.org/r/820416 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [09:26:21] (03CR) 10Krinkle: [C: 03+1] "Also matches the FeatureFeeds' extension default. I think we mostly try to move declarations to IS.php away from CS.php statements (part o" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821048 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling) [09:30:46] (03PS8) 10Btullis: Bootstrap etcd on the dse_k8s_etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/820416 (https://phabricator.wikimedia.org/T313129) [09:32:04] (03PS1) 10Ayounsi: Netbox: move db::dump_interval for profile default [puppet] - 10https://gerrit.wikimedia.org/r/821191 (https://phabricator.wikimedia.org/T262677) [09:33:00] (03PS2) 10Ayounsi: Netbox: move db::dump_interval to profile default [puppet] - 10https://gerrit.wikimedia.org/r/821191 (https://phabricator.wikimedia.org/T262677) [09:33:21] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:38:54] (03CR) 10Btullis: Bootstrap etcd on the dse_k8s_etcd cluster (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/820416 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [09:38:56] (03CR) 10Jbond: [V: 03+2 C: 03+2] scap: update venv to use the system ca bundle [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/804572 (owner: 10Jbond) [09:42:57] (03CR) 10Jcrespo: [C: 03+2] Revert "dbbackups: Move s4 eqiad snapshots from db1150 to db1145" [puppet] - 10https://gerrit.wikimedia.org/r/820868 (owner: 10Jcrespo) [09:49:37] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/820813 (https://phabricator.wikimedia.org/T311048) (owner: 10Ayounsi) [09:52:26] (03CR) 10Jbond: [C: 03+1] "LGTM assuming the authorisation comes" [puppet] - 10https://gerrit.wikimedia.org/r/820285 (https://phabricator.wikimedia.org/T292955) (owner: 10RhinosF1) [09:54:09] (03PS3) 10RhinosF1: admin: update ssh key for mnz [puppet] - 10https://gerrit.wikimedia.org/r/820285 (https://phabricator.wikimedia.org/T292955) [09:54:34] jbond: ty, i believe mortiz was doing the checks [09:56:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:57:21] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:58:05] (03CR) 10Ayounsi: [V: 03+1 C: 03+2] Netbox: add housekeeping systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/820813 (https://phabricator.wikimedia.org/T311048) (owner: 10Ayounsi) [09:59:29] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:08:24] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:10:14] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:11:37] 10SRE, 10SRE-Access-Requests, 10SecTeam-Processed: Add Larissa Gaulia to #mediawiki_security - https://phabricator.wikimedia.org/T314616 (10larissagaulia) Thank you all [10:15:30] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:17:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on db2079.codfw.wmnet with reason: Decom [10:18:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 12:00:00 on db2079.codfw.wmnet with reason: Decom [10:18:58] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:19:04] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:22:17] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ayounsi) p:05Medium→03High ` Aug 8 06:10:32 mr1-eqiad /kernel: KERN_ARP_ADDR_CHANGE: arp info overwritten for 10.65.2.255 from d0:8e:79:f4:1... [10:23:07] (03CR) 10Jbond: "cr looks good to me but its not clear from the commit why the splay is needed" [puppet] - 10https://gerrit.wikimedia.org/r/821173 (https://phabricator.wikimedia.org/T262677) (owner: 10Ayounsi) [10:23:57] (03PS3) 10Jcrespo: mariadb: Revert a few leftover disabled notif., belived to be wrong [puppet] - 10https://gerrit.wikimedia.org/r/820773 (https://phabricator.wikimedia.org/T311106) [10:24:06] (03PS1) 10Ladsgroup: mariadb: Decommission db2079 [puppet] - 10https://gerrit.wikimedia.org/r/821198 (https://phabricator.wikimedia.org/T313885) [10:25:35] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/821177 (https://phabricator.wikimedia.org/T310615) (owner: 10Ayounsi) [10:25:42] RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:25:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.decommission for hosts db2079.codfw.wmnet [10:26:02] (03CR) 10Jbond: [C: 03+1] "LGTM, of course will need the previous change to apply before merging this" [puppet] - 10https://gerrit.wikimedia.org/r/821178 (https://phabricator.wikimedia.org/T310615) (owner: 10Ayounsi) [10:26:18] (03CR) 10Jcrespo: mariadb: Revert a few leftover disabled notif., belived to be wrong (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/820773 (https://phabricator.wikimedia.org/T311106) (owner: 10Jcrespo) [10:26:27] (03CR) 10David Caro: [C: 03+2] kiwix: create dest dir before rsyncing if it does not exist (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/814707 (owner: 10David Caro) [10:26:45] (03PS3) 10Ayounsi: Netbox DB dump, hourly on secondary, daily on primary [puppet] - 10https://gerrit.wikimedia.org/r/821173 (https://phabricator.wikimedia.org/T262677) [10:26:54] (03PS4) 10Jcrespo: mariadb: Revert a few leftover disabled notif., belived to be wrong [puppet] - 10https://gerrit.wikimedia.org/r/820773 (https://phabricator.wikimedia.org/T311106) [10:26:54] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:27:24] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/821191 (https://phabricator.wikimedia.org/T262677) (owner: 10Ayounsi) [10:27:26] (03CR) 10Ayounsi: Netbox DB dump, hourly on secondary, daily on primary (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/821173 (https://phabricator.wikimedia.org/T262677) (owner: 10Ayounsi) [10:28:54] (03CR) 10Jcrespo: "Amir: this showed as a conflict- do you know where this comes from? Can it be deleted?" [puppet] - 10https://gerrit.wikimedia.org/r/768653 (https://phabricator.wikimedia.org/T302950) (owner: 10Gerrit maintenance bot) [10:29:27] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/820416 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [10:29:36] (03CR) 10Jcrespo: "Same here." [puppet] - 10https://gerrit.wikimedia.org/r/768652 (https://phabricator.wikimedia.org/T302950) (owner: 10Gerrit maintenance bot) [10:30:02] (03CR) 10Ayounsi: [C: 03+2] Netbox: move db::dump_interval to profile default [puppet] - 10https://gerrit.wikimedia.org/r/821191 (https://phabricator.wikimedia.org/T262677) (owner: 10Ayounsi) [10:30:22] !log ladsgroup@cumin1001 START - Cookbook sre.dns.netbox [10:30:41] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/821173 (https://phabricator.wikimedia.org/T262677) (owner: 10Ayounsi) [10:30:43] (03PS9) 10David Caro: prometheus: Add icmp blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/815910 [10:31:43] (03CR) 10Ladsgroup: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/820773 (https://phabricator.wikimedia.org/T311106) (owner: 10Jcrespo) [10:31:50] (03CR) 10Ayounsi: [C: 03+2] Netbox DB dump, hourly on secondary, daily on primary [puppet] - 10https://gerrit.wikimedia.org/r/821173 (https://phabricator.wikimedia.org/T262677) (owner: 10Ayounsi) [10:34:49] (03CR) 10Jbond: "LGTM, although one wonders if the damage is already done :P" [puppet] - 10https://gerrit.wikimedia.org/r/812343 (owner: 10David Caro) [10:34:50] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:35:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:35:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2079.codfw.wmnet [10:37:21] (03PS2) 10Ladsgroup: mariadb: Decommission db2079 [puppet] - 10https://gerrit.wikimedia.org/r/821198 (https://phabricator.wikimedia.org/T313885) [10:37:25] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Decommission db2079 [puppet] - 10https://gerrit.wikimedia.org/r/821198 (https://phabricator.wikimedia.org/T313885) (owner: 10Ladsgroup) [10:39:58] !log Removing db2079 from zarcillo (T313885) [10:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:01] T313885: decommission db2079 - https://phabricator.wikimedia.org/T313885 [10:43:21] !log Removing db2079 from orchestrator (T313885) [10:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:01] 10ops-codfw, 10decommission-hardware: decommission db2079 - https://phabricator.wikimedia.org/T313885 (10Ladsgroup) This host is ready for DC-Ops to decommission [10:45:09] 10ops-codfw, 10decommission-hardware: decommission db2079 - https://phabricator.wikimedia.org/T313885 (10Ladsgroup) a:03Papaul [10:46:16] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:47:37] (03CR) 10Jcrespo: [C: 03+2] mariadb: Revert a few leftover disabled notif., belived to be wrong [puppet] - 10https://gerrit.wikimedia.org/r/820773 (https://phabricator.wikimedia.org/T311106) (owner: 10Jcrespo) [10:49:01] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM, see comment inline." [puppet] - 10https://gerrit.wikimedia.org/r/820416 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [10:49:31] (03PS3) 10Ayounsi: Netbox: remove CSV dump directory and timer [puppet] - 10https://gerrit.wikimedia.org/r/821177 (https://phabricator.wikimedia.org/T310615) [10:51:54] (03PS2) 10Ayounsi: Netbox: remove Puppet config related to CSV dumps [puppet] - 10https://gerrit.wikimedia.org/r/821178 (https://phabricator.wikimedia.org/T310615) [10:53:35] (03CR) 10Ayounsi: [C: 03+2] Netbox: remove CSV dump directory and timer [puppet] - 10https://gerrit.wikimedia.org/r/821177 (https://phabricator.wikimedia.org/T310615) (owner: 10Ayounsi) [10:57:44] (03CR) 10David Caro: [C: 03+2] gitignore: add note to use global ignore file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812343 (owner: 10David Caro) [10:57:51] (03PS4) 10David Caro: gitignore: add note to use global ignore file [puppet] - 10https://gerrit.wikimedia.org/r/812343 [10:59:25] (03CR) 10Ayounsi: [C: 03+2] Netbox: remove Puppet config related to CSV dumps [puppet] - 10https://gerrit.wikimedia.org/r/821178 (https://phabricator.wikimedia.org/T310615) (owner: 10Ayounsi) [10:59:37] (03PS3) 10Ayounsi: Netbox: remove Puppet config related to CSV dumps [puppet] - 10https://gerrit.wikimedia.org/r/821178 (https://phabricator.wikimedia.org/T310615) [11:01:56] this is a weird warning: "WARN: 0 puppet certs need to be renewed:" [11:02:55] (03CR) 10Abijeet Patro: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820869 (https://phabricator.wikimedia.org/T311587) (owner: 10Abijeet Patro) [11:03:01] (03PS2) 10Abijeet Patro: Enable message bundle on MetaWiki for WikiLearn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820869 (https://phabricator.wikimedia.org/T311587) [11:08:57] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:09:55] (03CR) 10Ayounsi: "This looks like a CI issue?" [software/homer] - 10https://gerrit.wikimedia.org/r/820778 (https://phabricator.wikimedia.org/T310745) (owner: 10Ayounsi) [11:10:44] (03CR) 10Ayounsi: "I think this is safe to merge?" [dns] - 10https://gerrit.wikimedia.org/r/803460 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [11:12:13] (03CR) 10Ayounsi: [C: 03+2] provision cookbook: configure switches using cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/811730 (owner: 10Ayounsi) [11:13:46] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] wmf-netbox: remove deprecated functions [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792622 (owner: 10Ayounsi) [11:14:27] (03CR) 10Nikerabbit: Enable message bundle on MetaWiki for WikiLearn (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820869 (https://phabricator.wikimedia.org/T311587) (owner: 10Abijeet Patro) [11:14:39] (03CR) 10Kevin Bazira: ml-services: Add arwiki, cswiki & enwiki drafttopic isvcs to staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/821168 (https://phabricator.wikimedia.org/T314456) (owner: 10Kevin Bazira) [11:14:42] (03CR) 10Jcrespo: [C: 04-1] "I think the best solution is to implement 3 separate "modes":" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/820664 (owner: 10Jcrespo) [11:16:45] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:16:59] (03Merged) 10jenkins-bot: provision cookbook: configure switches using cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/811730 (owner: 10Ayounsi) [11:21:09] !log kubectl uncordon kubernetes2022.codfw.wmnet [11:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:44] !log jelto@cumin1001 conftool action : set/pooled=yes; selector: name=kubernetes2022.codfw.wmnet [11:33:35] (03PS1) 10Urbanecm: Move WEIGHT_* constants to IMentorWeights [extensions/GrowthExperiments] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/820870 (https://phabricator.wikimedia.org/T314362) [11:34:24] (03PS1) 10Urbanecm: MentorTools: Do not use MentorWeightManager [extensions/GrowthExperiments] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/820871 (https://phabricator.wikimedia.org/T314362) [11:34:47] (03Abandoned) 10Urbanecm: Move WEIGHT_* constants to IMentorWeights [extensions/GrowthExperiments] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/820870 (https://phabricator.wikimedia.org/T314362) (owner: 10Urbanecm) [11:36:34] jouncebot: nowandnext [11:36:34] No deployments scheduled for the next 1 hour(s) and 23 minute(s) [11:36:34] In 1 hour(s) and 23 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220808T1300) [11:36:40] (03CR) 10Urbanecm: [C: 03+2] MentorTools: Do not use MentorWeightManager [extensions/GrowthExperiments] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/820871 (https://phabricator.wikimedia.org/T314362) (owner: 10Urbanecm) [11:43:22] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1102.eqiad.wmnet [11:48:40] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:52:04] RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:54:29] (03CR) 10Btullis: Bootstrap etcd on the dse_k8s_etcd cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/820416 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [11:58:38] PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:01:35] (03Merged) 10jenkins-bot: MentorTools: Do not use MentorWeightManager [extensions/GrowthExperiments] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/820871 (https://phabricator.wikimedia.org/T314362) (owner: 10Urbanecm) [12:03:44] PROBLEM - Host an-worker1102 is DOWN: PING CRITICAL - Packet loss = 100% [12:04:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [12:06:14] !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.23/extensions/GrowthExperiments/: 3eaf155678b7313c55dcca0cd39ab29f73eead37: MentorTools: Do not use MentorWeightManager (T314362) (duration: 03m 31s) [12:06:17] T314362: Ensure MentorWeightManager is not used with structured mentor list - https://phabricator.wikimedia.org/T314362 [12:06:21] * urbanecm done [12:09:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [12:09:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [12:11:02] PROBLEM - Check systemd state on mw2393 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:12:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [12:25:36] RECOVERY - Host an-worker1102 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [12:26:22] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:16] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1102.eqiad.wmnet [12:49:42] (03PS1) 10Urbanecm: Growth: Add new rights to wgAvailableRights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821215 [12:51:10] (03CR) 10Urbanecm: [C: 03+2] Growth: Add new rights to wgAvailableRights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821215 (owner: 10Urbanecm) [12:52:12] (03Merged) 10jenkins-bot: Growth: Add new rights to wgAvailableRights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821215 (owner: 10Urbanecm) [12:56:22] !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: 77fd5abdd7d9462869259e1511bbcf2d7ce62246: Growth: Add new rights to wgAvailableRights (duration: 03m 24s) [12:58:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [12:59:02] (03PS1) 10Jbond: wmflib: fix ipresolve AAAA string representation [puppet] - 10https://gerrit.wikimedia.org/r/821216 (https://phabricator.wikimedia.org/T314776) [12:59:56] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:00:04] RoanKattouw, Urbanecm, and awight: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220808T1300). [13:00:04] No Gerrit patches in the queue for this window AFAICS. [13:00:16] * urbanecm waves [13:00:27] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36641/console" [puppet] - 10https://gerrit.wikimedia.org/r/821216 (https://phabricator.wikimedia.org/T314776) (owner: 10Jbond) [13:01:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:01:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:02:05] 10SRE, 10Infrastructure-Foundations, 10netops: Complete testing of SONiC NOS / Dell network gear and write up - https://phabricator.wikimedia.org/T310901 (10cmooney) @ayounsi @Papaul I've done the first draft of the summary here: https://wikitech.wikimedia.org/wiki/Dell_Enterprise_Sonic_Evaluation Feel fre... [13:03:12] (03CR) 10CI reject: [V: 04-1] wmflib: fix ipresolve AAAA string representation [puppet] - 10https://gerrit.wikimedia.org/r/821216 (https://phabricator.wikimedia.org/T314776) (owner: 10Jbond) [13:03:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:08:20] (03PS2) 10Jbond: wmflib: fix ipresolve AAAA string representation [puppet] - 10https://gerrit.wikimedia.org/r/821216 (https://phabricator.wikimedia.org/T314776) [13:10:51] (03PS3) 10Jbond: wmflib: fix ipresolve AAAA string representation [puppet] - 10https://gerrit.wikimedia.org/r/821216 (https://phabricator.wikimedia.org/T314776) [13:12:04] (03PS4) 10Jbond: wmflib: fix ipresolve AAAA string representation [puppet] - 10https://gerrit.wikimedia.org/r/821216 (https://phabricator.wikimedia.org/T314776) [13:12:06] (03CR) 10Cathal Mooney: [C: 03+1] "Makes sense nice work! If it were Python I'd suggest manipulating the addresses using the ipaddress library rather than string splitting " [puppet] - 10https://gerrit.wikimedia.org/r/816006 (https://phabricator.wikimedia.org/T309209) (owner: 10David Caro) [13:12:18] (03PS5) 10Jbond: wmflib: fix ipresolve AAAA string representation [puppet] - 10https://gerrit.wikimedia.org/r/821216 (https://phabricator.wikimedia.org/T314776) [13:15:03] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:gerrit: add ipaddress to host_aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819506 (https://phabricator.wikimedia.org/T303857) (owner: 10Jbond) [13:15:59] 10SRE, 10Infrastructure-Foundations, 10netops: Complete testing of SONiC NOS / Dell network gear and write up - https://phabricator.wikimedia.org/T310901 (10Papaul) @cmooney thanks for putting this together. [13:29:38] PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:32:22] (03PS6) 10Jbond: wmflib: fix ipresolve AAAA string representation [puppet] - 10https://gerrit.wikimedia.org/r/821216 (https://phabricator.wikimedia.org/T314776) [13:33:52] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:42:19] (ProbeDown) firing: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:42:45] hotlink? [13:42:49] (03PS1) 10Btullis: Add thirdparty/bigtop15 component to wikimedia-bullseye [puppet] - 10https://gerrit.wikimedia.org/r/821223 (https://phabricator.wikimedia.org/T310643) [13:43:02] looking [13:44:10] jbond: yep, looks like https://upload.wikimedia.org/wikipedia/commons/d/db/Neha_Hinge_%2811%29.jpg [13:44:41] XioNoX: ack thanks do you know what the action taken previously was? create a requesctl rule for this image? [13:44:46] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36642/console" [puppet] - 10https://gerrit.wikimedia.org/r/821223 (https://phabricator.wikimedia.org/T310643) (owner: 10Btullis) [13:45:42] and again it's with referrer https://click-it.me/ [13:45:53] jbond: the spike is over it will self heal [13:45:53] jbond: that's the best option yeah [13:45:55] heel [13:45:57] or ... just wait [13:45:59] heh [13:46:19] I was right, heal [13:46:33] ack thanks ill see if there is anyone in traffic to try and progress the hotlinking patch [13:46:34] we can have a strong rate limit for for https://click-it.me/ [13:46:54] jbond: the hotlink patch only apply with an empty referer? [13:47:02] because now it's set to that url [13:47:19] (ProbeDown) resolved: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:47:23] XioNoX: no it applies a rate limit to anythin that dosn;t have a referer with a WMF domain [13:47:30] nice [13:47:38] so yeah that will help [13:47:39] there is some need to handle 'allowed' referrers like with maps [13:47:59] (just out of idle curiosity, how did you find out which file / figure out that it was a load of hotlink traffic?) [13:48:05] or temporarily have a strict rate limit for that specific one as we keep seeing it [13:48:17] cdanis: https://gerrit.wikimedia.org/r/c/operations/puppet/+/768723 if intrested [13:48:20] TheresNoTime: experience :) [13:48:26] XioNoX: it should be possible to add per-URL bytes egress limit using haproxy stick-tables [13:48:28] TheresNoTime: but confirmed with NEL data [13:48:46] TheresNoTime: dunno if you have access to https://logstash.wikimedia.org/app/dashboards#/view/ee6432c0-82a9-11eb-9d45-739221ba7fb6 [13:48:54] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-eqsin.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [13:48:54] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-eqsin.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [13:48:58] ahaha we had enough NEL errors for the image that it showed up there? I was expecting you to say you looked at centrallog1001 weblog [13:49:14] cdanis: yeah :) [13:49:18] checking those [13:49:21] another thing we could do: redirect to another site dynamically [13:49:24] might be the same with some latency [13:49:25] XioNoX: ah I do, I initially went to logstash but didn't think to look there :) guess that's where the experience comes in :D [13:49:59] the two 'quick' places to look on logstash are varnish5xx and NEL [13:51:21] thank you :) [13:51:35] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for virginiapoundstone - https://phabricator.wikimedia.org/T314676 (10VirginiaPoundstone) [13:51:41] jbond: you can ack/ignore the librenms alert it should recover, I'm keeping an eye on it [13:51:50] ack thanks [13:52:00] XioNoX: I'll add writing up a brief proposal about haproxy stick-tables to my list this week [13:52:04] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for virginiapoundstone - https://phabricator.wikimedia.org/T314676 (10VirginiaPoundstone) >>! In T314676#8135321, @Aklapper wrote: > Note that the Phabricator account @VirginiaPoundstone is linked to [a self-created, non-WMF SUL wik... [13:52:21] varnish can't easily do throttling by bytes egress, but haproxy can [13:53:33] cdanis: could be worth having a "NELs by url" visualisation on the NEL dashboard for those usecases too [13:53:40] indeed [13:53:49] +1 [13:53:54] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-eqsin.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [13:53:54] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-eqsin.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [13:53:55] XioNoX: please feel free to edit the dashboard or file a task ;) [13:54:10] cdanis: sorry I can't hear you, you're too far away [13:54:29] cdanis: cool pour le stick-tables, I'll need a tldr, the task is becoming huge :) [13:54:54] ahahah [13:55:46] p [13:55:53] okay I made some notes to myself, off to a meeting now [13:55:57] thanks jbond XioNoX [13:56:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:59:10] np and cheers cdanis, XioNoX :) [14:00:09] cdanis: I added it ;) [14:01:34] (03CR) 10Elukey: [C: 03+2] ml-services: Add arwiki, cswiki & enwiki drafttopic isvcs to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/821168 (https://phabricator.wikimedia.org/T314456) (owner: 10Kevin Bazira) [14:01:54] (03CR) 10Elukey: [C: 03+2] "Ok let's see how it goes! If needed we'll prune some isvcs in the future :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/821168 (https://phabricator.wikimedia.org/T314456) (owner: 10Kevin Bazira) [14:06:18] (03CR) 10David Caro: [C: 03+2] ceph:osd: add support for multi-network setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/816006 (https://phabricator.wikimedia.org/T309209) (owner: 10David Caro) [14:09:21] (03CR) 10Filippo Giunchedi: [C: 03+1] Add puppet profile and role files for WikiFunctions. [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang) [14:10:53] (03CR) 10Ori: [C: 03+2] Add puppet profile and role files for WikiFunctions. [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang) [14:11:01] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36643/console" [puppet] - 10https://gerrit.wikimedia.org/r/821223 (https://phabricator.wikimedia.org/T310643) (owner: 10Btullis) [14:11:02] !log kevinbazira@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [14:12:29] 10SRE, 10Traffic, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10HTTPS: Get new edge & internal HTTPS certificates expanded to add wikifunctions.org and *.wikifunctions.org - https://phabricator.wikimedia.org/T313227 (10Jdforrester-WMF) a:05Jdforrester-WMF→03None [14:17:54] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:20:48] (03PS1) 10Samtar: logos/manage.py: Use shortened link in user agent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821246 [14:20:50] (03PS1) 10Elukey: ml-services: update editquality's Docker image and settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/821247 (https://phabricator.wikimedia.org/T301878) [14:20:52] (03PS1) 10Elukey: ml-services: test the new Docker image for articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/821248 (https://phabricator.wikimedia.org/T301878) [14:22:31] 10SRE, 10DNS, 10Traffic, 10Abstract Wikipedia team (Phase κ – Clean-up): Establish wikifunctions.org - https://phabricator.wikimedia.org/T275904 (10Jdforrester-WMF) [14:26:08] (03CR) 10Jbond: [C: 03+2] Add thirdparty/bigtop15 component to wikimedia-bullseye [puppet] - 10https://gerrit.wikimedia.org/r/821223 (https://phabricator.wikimedia.org/T310643) (owner: 10Btullis) [14:26:56] (03CR) 10Elukey: [C: 03+2] ml-services: update editquality's Docker image and settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/821247 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey) [14:29:06] (03CR) 10Elukey: [C: 03+2] ml-services: test the new Docker image for articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/821248 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey) [14:33:22] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [14:33:22] (03CR) 10JHathaway: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/821216 (https://phabricator.wikimedia.org/T314776) (owner: 10Jbond) [14:34:27] 10SRE, 10SRE-Access-Requests, 10SecTeam-Processed: Add Larissa Gaulia to #mediawiki_security - https://phabricator.wikimedia.org/T314616 (10sbassett) 05In progress→03Resolved [14:34:49] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [14:41:40] (03CR) 10Jbond: [C: 03+1] Remove CSV dump scripts [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/821180 (https://phabricator.wikimedia.org/T310615) (owner: 10Ayounsi) [14:41:54] (03CR) 10Ahmon Dancy: wmflib: fix ipresolve AAAA string representation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/821216 (https://phabricator.wikimedia.org/T314776) (owner: 10Jbond) [14:42:05] (03CR) 10Jbond: [C: 03+1] Bump pynetbox to ~= 6.6 [software/spicerack] - 10https://gerrit.wikimedia.org/r/820806 (https://phabricator.wikimedia.org/T310745) (owner: 10Ayounsi) [14:42:17] (03CR) 10Jbond: [C: 03+1] Bump pynetbox to ~= 6.6 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/820808 (https://phabricator.wikimedia.org/T310745) (owner: 10Ayounsi) [14:45:25] (03PS7) 10Jbond: wmflib: fix ipresolve AAAA string representation [puppet] - 10https://gerrit.wikimedia.org/r/821216 (https://phabricator.wikimedia.org/T314776) [14:45:36] (03CR) 10Jbond: wmflib: fix ipresolve AAAA string representation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/821216 (https://phabricator.wikimedia.org/T314776) (owner: 10Jbond) [14:45:50] 10SRE, 10Community-Tech, 10Data-Persistence (Consultation), 10MediaWiki-extensions-Phonos: SRE/Data Persistence consultation — use of FSFileBackend for caching audio files - https://phabricator.wikimedia.org/T314789 (10TheresNoTime) [14:46:44] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on cp5001.eqsin.wmnet with reason: depooled: faulty DIMM: T314256 [14:46:47] (03PS2) 10Jbond: Bump pynetbox to ~= 6.6 [software/homer] - 10https://gerrit.wikimedia.org/r/820778 (https://phabricator.wikimedia.org/T310745) (owner: 10Ayounsi) [14:46:47] T314256: cp5001 memory errors on DIMM A2 - https://phabricator.wikimedia.org/T314256 [14:46:49] (03PS1) 10Jbond: homer: add pyproject.toml [software/homer] - 10https://gerrit.wikimedia.org/r/821254 [14:47:00] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on cp5001.eqsin.wmnet with reason: depooled: faulty DIMM: T314256 [14:49:49] (03PS2) 10Jbond: homer: add pyproject.toml [software/homer] - 10https://gerrit.wikimedia.org/r/821254 [14:50:41] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus::blackbox::http: add/edit parameter comments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807176 (owner: 10Dzahn) [14:51:41] (03CR) 10CI reject: [V: 04-1] Bump pynetbox to ~= 6.6 [software/homer] - 10https://gerrit.wikimedia.org/r/820778 (https://phabricator.wikimedia.org/T310745) (owner: 10Ayounsi) [14:52:30] (03CR) 10CI reject: [V: 04-1] homer: add pyproject.toml [software/homer] - 10https://gerrit.wikimedia.org/r/821254 (owner: 10Jbond) [14:53:04] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:53:27] (03PS1) 10Giuseppe Lavagetto: role::alerting_host: run vopsbot [puppet] - 10https://gerrit.wikimedia.org/r/821255 [14:55:40] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [14:55:47] (03CR) 10Ori: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/821256 (https://phabricator.wikimedia.org/T311457) (owner: 10Ori) [14:56:07] (03CR) 10CI reject: [V: 04-1] homer: add pyproject.toml [software/homer] - 10https://gerrit.wikimedia.org/r/821254 (owner: 10Jbond) [14:56:56] (03CR) 10CI reject: [V: 04-1] role::alerting_host: run vopsbot [puppet] - 10https://gerrit.wikimedia.org/r/821255 (owner: 10Giuseppe Lavagetto) [14:56:58] (03CR) 10Ahmon Dancy: [C: 03+1] "Looks reasonable." [puppet] - 10https://gerrit.wikimedia.org/r/820749 (owner: 10Jaime Nuche) [14:58:44] (03CR) 10Filippo Giunchedi: [C: 03+1] "This LGTM, (non blocking, hence the +1) please note that that address *might* receive alerts from non-production alertmanager deployments " [puppet] - 10https://gerrit.wikimedia.org/r/821256 (https://phabricator.wikimedia.org/T311457) (owner: 10Ori) [14:59:37] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [15:01:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:02:12] 10SRE, 10Community-Tech, 10Data-Persistence (Consultation), 10MediaWiki-extensions-Phonos, 10serviceops: SRE/Data Persistence consultation — use of FSFileBackend for caching audio files - https://phabricator.wikimedia.org/T314789 (10TheresNoTime) [15:02:24] (03PS1) 10Jbond: ganeti-netbox-sync: just use the default CA buyndle [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/821257 [15:03:40] (03CR) 10Ori: alertmanager: route abstract-wikipedia-critical alert e-mails to Slack (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/821256 (https://phabricator.wikimedia.org/T311457) (owner: 10Ori) [15:04:19] (03PS3) 10Jbond: Bump pynetbox to ~= 6.6 [software/homer] - 10https://gerrit.wikimedia.org/r/820778 (https://phabricator.wikimedia.org/T310745) (owner: 10Ayounsi) [15:06:10] (03PS4) 10Jbond: Bump pynetbox to ~= 6.6 [software/homer] - 10https://gerrit.wikimedia.org/r/820778 (https://phabricator.wikimedia.org/T310745) (owner: 10Ayounsi) [15:06:23] (03CR) 10Filippo Giunchedi: [C: 03+1] alertmanager: route abstract-wikipedia-critical alert e-mails to Slack (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/821256 (https://phabricator.wikimedia.org/T311457) (owner: 10Ori) [15:06:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:08:18] 10SRE, 10SRE-swift-storage, 10ops-eqiad: Failed disk in ms-be1066 - https://phabricator.wikimedia.org/T314143 (10Cmjohnson) Case opened, You have successfully submitted request SR148431542. [15:09:16] (03PS3) 10Jbond: homer: add pyproject.toml [software/homer] - 10https://gerrit.wikimedia.org/r/821254 [15:09:30] (03CR) 10Ayounsi: [C: 03+1] "Manually tested and works fine." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/821257 (owner: 10Jbond) [15:09:56] 10SRE-OnFire, 10observability, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Business hours oncall implementation delays pages to batphone by 5 minutes when there are no oncallers - https://phabricator.wikimedia.org/T313603 (10herron) >>! In T313603#8126164, @CDanis wrote:... [15:10:06] 10SRE, 10Infrastructure-Foundations: Management interface SSH icinga alerts - https://phabricator.wikimedia.org/T304289 (10Cmjohnson) [15:10:15] (03CR) 10Ayounsi: [C: 03+1] "But we need to deploy I482631ebf972e755cd9ef1f11175854c0581bcae first if not already done." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/821257 (owner: 10Jbond) [15:10:46] (03PS5) 10Jbond: Bump pynetbox to ~= 6.6 [software/homer] - 10https://gerrit.wikimedia.org/r/820778 (https://phabricator.wikimedia.org/T310745) (owner: 10Ayounsi) [15:10:57] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): cloudvirt1021 mgmt flapping - https://phabricator.wikimedia.org/T314413 (10Cmjohnson) 05Open→03Resolved swapped cable [15:11:14] 10SRE, 10Community-Tech, 10Data-Persistence (Consultation), 10MediaWiki-extensions-Phonos, 10serviceops: SRE/Data Persistence consultation — use of FSFileBackend for caching audio files - https://phabricator.wikimedia.org/T314789 (10MusikAnimal) It's worth mentioning that, like [[ https://www.mediawiki.o... [15:12:01] (03CR) 10Jbond: [C: 03+2] wmflib: fix ipresolve AAAA string representation [puppet] - 10https://gerrit.wikimedia.org/r/821216 (https://phabricator.wikimedia.org/T314776) (owner: 10Jbond) [15:12:25] (03CR) 10Ahmon Dancy: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/821216 (https://phabricator.wikimedia.org/T314776) (owner: 10Jbond) [15:14:50] (03CR) 10Ayounsi: [C: 03+2] Remove CSV dump scripts [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/821180 (https://phabricator.wikimedia.org/T310615) (owner: 10Ayounsi) [15:15:30] (03CR) 10Ori: alertmanager: route abstract-wikipedia-critical alert e-mails to Slack (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/821256 (https://phabricator.wikimedia.org/T311457) (owner: 10Ori) [15:16:15] (03PS2) 10Ori: alertmanager: route abstract-wikipedia-critical alert e-mails to Slack [puppet] - 10https://gerrit.wikimedia.org/r/821256 (https://phabricator.wikimedia.org/T311457) [15:17:39] (03PS1) 10Andrew Bogott: trove-guestagent.conf: standardize rabbitmq config [puppet] - 10https://gerrit.wikimedia.org/r/821261 (https://phabricator.wikimedia.org/T314522) [15:19:31] (03PS1) 10Elukey: ml-services: add environment variables to editquality pods/isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/821263 (https://phabricator.wikimedia.org/T301878) [15:20:18] (ProbeDown) firing: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-ssl:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:20:35] (03PS4) 10Jbond: homer: add pyproject.toml [software/homer] - 10https://gerrit.wikimedia.org/r/821254 [15:21:26] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36644/console" [puppet] - 10https://gerrit.wikimedia.org/r/821256 (https://phabricator.wikimedia.org/T311457) (owner: 10Ori) [15:22:11] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+1] "LGTM https://puppet-compiler.wmflabs.org/pcc-worker1001/36644/alert1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/821256 (https://phabricator.wikimedia.org/T311457) (owner: 10Ori) [15:22:54] (03CR) 10Ori: [C: 03+2] alertmanager: route abstract-wikipedia-critical alert e-mails to Slack [puppet] - 10https://gerrit.wikimedia.org/r/821256 (https://phabricator.wikimedia.org/T311457) (owner: 10Ori) [15:25:14] (ProbeDown) resolved: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-ssl:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:25:21] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:gerrit: add ipaddress to host_aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819506 (https://phabricator.wikimedia.org/T303857) (owner: 10Jbond) [15:26:13] (03CR) 10Ahmon Dancy: P:gerrit: add ipaddress to host_aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819506 (https://phabricator.wikimedia.org/T303857) (owner: 10Jbond) [15:27:31] (03PS5) 10Jbond: netbox: increase TTL to 1D [dns] - 10https://gerrit.wikimedia.org/r/803460 (https://phabricator.wikimedia.org/T296452) [15:27:47] (03CR) 10Jbond: [C: 03+2] netbox: increase TTL to 1D [dns] - 10https://gerrit.wikimedia.org/r/803460 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [15:27:54] (03CR) 10Jbond: [C: 03+2] "thanks" [dns] - 10https://gerrit.wikimedia.org/r/803460 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [15:28:32] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:30:05] jan_drewniak: That opportune time is upon us again. Time for a Wikimedia Portals Update deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220808T1530). [15:30:22] PROBLEM - Host es2021 is DOWN: PING CRITICAL - Packet loss = 100% [15:31:23] 10SRE, 10DNS, 10Traffic, 10Abstract Wikipedia team (Phase κ – Clean-up): Establish wikifunctions.org - https://phabricator.wikimedia.org/T275904 (10BBlack) The existing google IP apparently doesn't even have TLS (just old port 80), so it defaults to an insecure site warning in Chrome. Google's public reso... [15:32:02] PROBLEM - MariaDB Replica IO: es4 on es2022 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@es2021.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on es2021.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:32:18] PROBLEM - MariaDB Replica IO: es4 on es2020 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@es2021.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on es2021.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:32:38] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1084.eqiad.wmnet with OS bullseye [15:33:46] RECOVERY - Host es2021 is UP: PING OK - Packet loss = 0%, RTA = 31.71 ms [15:35:11] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36645/console" [puppet] - 10https://gerrit.wikimedia.org/r/813278 (https://phabricator.wikimedia.org/T312882) (owner: 10Ottomata) [15:35:43] PROBLEM - MariaDB read only es4 #page on es2021 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:36:03] I'm around [15:36:08] gonna downtime it [15:36:16] acked [15:36:22] PROBLEM - MariaDB Replica SQL: es4 on es2021 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:36:34] planned? [15:36:46] around as well [15:36:56] to my knowledge [15:37:20] PROBLEM - mysqld processes on es2021 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:37:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2021.codfw.wmnet with reason: Maint [15:37:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2021.codfw.wmnet with reason: Maint [15:37:58] ok, will resolve [15:39:39] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Bump pynetbox to ~= 6.6 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/820808 (https://phabricator.wikimedia.org/T310745) (owner: 10Ayounsi) [15:41:12] PROBLEM - MariaDB Replica Lag: es4 on es2020 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 747.98 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:45:19] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1084.eqiad.wmnet with reason: host reimage [15:46:10] !log upload reprepro -C main include bullseye-wikimedia python-pynetbox_6.6.0-1+wmf11u1_amd64.changes [15:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:23] what was the issue? [15:47:58] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1084.eqiad.wmnet with reason: host reimage [15:49:22] 10SRE, 10ops-codfw, 10serviceops: decommission mw2251-mw2255, mw2257-mw2258 - https://phabricator.wikimedia.org/T313730 (10Papaul) [15:49:30] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:49:33] I didn't get paged through VO, did anyone? [15:49:54] you shouldnpt have not if everything works as expected [15:50:03] oh right it's working hours :) thanks [15:50:11] 10SRE, 10ops-codfw, 10serviceops: decommission mw2251-mw2255, mw2257-mw2258 - https://phabricator.wikimedia.org/T313730 (10Papaul) 05Open→03Resolved complete [15:51:18] 10SRE, 10ops-codfw, 10DC-Ops: db2135 (C6) lost power supply redundancy - https://phabricator.wikimedia.org/T314628 (10Papaul) 05Open→03Resolved This is complete [15:53:57] however i didn't get paged either [15:54:00] PROBLEM - MariaDB Replica Lag: es4 on es2022 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1516.12 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:54:38] jbond: maybe your schedule had finished already? [15:54:44] let me see [15:54:51] it shouldn;t finish for another 6 minutes [15:54:53] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/820777 (owner: 10Faidon Liambotis) [15:54:57] thanks jynus [15:55:26] I got the page FWIW [15:55:44] RECOVERY - IPMI Sensor Status on es2021 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:55:51] could be from the chnages this morning i may have got put on an earlier shift [15:55:58] I see ir alrady finished [15:56:18] note it is on Cathal's name [15:57:32] I wonder if you edited/checked your batphone schedule, not the emea pool 2 [15:58:01] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135 [15:58:05] T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 [15:58:08] jynus: i have not edited anything [15:58:17] the ui confuses me too much :) [15:58:27] i do see here though https://portal.victorops.com/dash/wikimedia#/team/team-ra3ayi0mHc3Nr6qu/on-call-schedule that im no longer mentioned [15:58:30] for today [15:58:38] jbond: when you start your week you are supposed to edit it to adjust to your prefered schedule [15:58:45] as per manual [15:59:02] jynus: cdanis: has added a bot that should do that automatically [15:59:08] ah [15:59:12] cdanis: please correct me if im wrong [15:59:18] that part I didnt know [15:59:27] i think it got added last week at some point [15:59:34] 10SRE, 10serviceops: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10Michael) [15:59:34] but [15:59:43] jbond: no, that's for not needing to edit "immediate" vs "5 minutes" when the business hours rotation is in effect [15:59:46] for escalation to batphone [15:59:49] I can confirm your schedule finished already [16:00:01] at the start of the week, the oncallers are still supposed to edit the business hours rotation to your preferred hours [16:00:17] 10SRE, 10serviceops: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Michael) [16:00:17] sorry it is confusing, you are not alone :-) [16:00:20] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1084.eqiad.wmnet with OS bullseye [16:00:24] 10SRE, 10serviceops: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Michael) [16:00:31] perhaps I'll go over it briefly in the meeting [16:00:36] cdanis: ahh yes that bit i did with leo (or should i say leo did for me) [16:00:47] however i think there was some issue that jynus fixed for me this morning [16:01:04] hopefully it will get easier with time + improvements [16:01:09] ah [16:01:10] * jbond hopes [16:01:23] but I just touched the override for cathal, not the actual schedule [16:01:35] it could be reseted it, though [16:02:38] in any case, please adjust the hours to the ones you prefer now :-D [16:02:50] jynus: yes will do [16:03:17] cdanis: there was an issue with the automation, I think, not sure if saw scrollback [16:03:25] during the weekend [16:03:40] jynus: victorops was erroneously configured, was the issue [16:03:47] yeah [16:03:48] (this weekend) [16:03:53] not an issue with the automation ;) [16:03:53] don't know the details, sorry [16:04:15] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1085.eqiad.wmnet with OS bullseye [16:04:19] yeah, sorry I didn't mean automation, as something in the procedure or something [16:04:38] I don't know the details, joe was more involved on that part [16:05:33] (03CR) 10Elukey: [C: 03+2] ml-services: add environment variables to editquality pods/isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/821263 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey) [16:09:24] 10SRE, 10serviceops-radar, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10ayounsi) FYI, I made this dashboard a while ago: https://logstash.wikimedia.org/app/dashboards#/view/AWm67Kpk8aQffZ3HmRpW hopefully it ca... [16:09:41] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135 [16:09:45] T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 [16:10:32] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [16:10:41] 10SRE, 10ops-codfw, 10DBA: es2021 (B3) lost power supply redundancy - https://phabricator.wikimedia.org/T314559 (10Papaul) 05Open→03Resolved This is complete [16:10:47] 10SRE, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10Papaul) [16:11:08] RECOVERY - MariaDB Replica SQL: es4 on es2021 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:12:04] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [16:12:58] RECOVERY - MariaDB Replica IO: es4 on es2022 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:13:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:14:12] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [16:14:35] RECOVERY - MariaDB read only es4 #page on es2021 is OK: Version 10.4.25-MariaDB-log, Uptime 393s, read_only: True, event_scheduler: True, 29.45 QPS, connection latency: 0.003711s, query latency: 0.000540s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:16:17] 10SRE, 10ops-codfw: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T314607 (10Papaul) 05Open→03Declined There is already a task for this @ T314509 [16:16:18] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:16:29] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135 [16:16:32] T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 [16:16:39] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1085.eqiad.wmnet with reason: host reimage [16:16:50] RECOVERY - MariaDB Replica IO: es4 on es2020 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:17:00] (03CR) 10Tacsipacsi: "Why is this far away from other Translate settings ($wmgUseTranslate, $wmgTranslateWorkflowStates etc.)?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820869 (https://phabricator.wikimedia.org/T311587) (owner: 10Abijeet Patro) [16:18:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:19:22] RECOVERY - mysqld processes on es2021 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [16:19:36] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1085.eqiad.wmnet with reason: host reimage [16:20:05] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2088 - https://phabricator.wikimedia.org/T313797 (10Papaul) [16:20:26] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2088 - https://phabricator.wikimedia.org/T313797 (10Papaul) 05Open→03Resolved complete [16:24:58] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host elastic1085.eqiad.wmnet with OS bullseye [16:25:20] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2090 - https://phabricator.wikimedia.org/T314109 (10Papaul) [16:25:59] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2090 - https://phabricator.wikimedia.org/T314109 (10Papaul) 05Open→03Resolved complete [16:26:09] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1085.eqiad.wmnet with OS bullseye [16:26:10] (03PS1) 10Hnowlan: Create basic haproxy container [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/821275 (https://phabricator.wikimedia.org/T233196) [16:27:55] RECOVERY - MariaDB Replica Lag: es4 on es2020 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:30:42] RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:32:08] (03PS2) 10Hnowlan: Create basic haproxy container [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/821275 (https://phabricator.wikimedia.org/T233196) [16:33:31] 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T314509 (10Papaul) @fgiunchedi can you please take a look at this alert i see only Smart Storage Battery failed and no disk failed. Thanks [16:33:45] 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2032 - https://phabricator.wikimedia.org/T314427 (10Papaul) @fgiunchedi can you please take a look at this alert i see only Smart Storage Battery failed and no disk failed. Thanks [16:33:51] 10SRE, 10Community-Tech, 10Data-Persistence (Consultation), 10MediaWiki-extensions-Phonos, 10serviceops: SRE/Data Persistence consultation — use of FSFileBackend for caching audio files - https://phabricator.wikimedia.org/T314789 (10Legoktm) I don't fully understand how FSFileBackend will work here, as t... [16:36:00] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2072 - https://phabricator.wikimedia.org/T313911 (10Papaul) [16:36:03] (03CR) 10Jbond: [C: 03+2] ganeti-netbox-sync: just use the default CA buyndle [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/821257 (owner: 10Jbond) [16:36:43] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2072 - https://phabricator.wikimedia.org/T313911 (10Papaul) 05Open→03Resolved complete [16:38:01] (03CR) 10Btullis: [V: 03+1 C: 03+2] Puppetize spark3 installation and configs using conda-analytics env [puppet] - 10https://gerrit.wikimedia.org/r/813278 (https://phabricator.wikimedia.org/T312882) (owner: 10Ottomata) [16:38:01] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10bking) [16:38:10] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1085.eqiad.wmnet with reason: host reimage [16:38:52] PROBLEM - Check systemd state on elastic1063 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:39:05] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:40:16] RECOVERY - Check systemd state on elastic1063 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:40:22] (03CR) 10David Caro: "Hmm, by the logs of the failure it's the flake8 test not prospector." [software/homer] - 10https://gerrit.wikimedia.org/r/821254 (owner: 10Jbond) [16:41:16] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:41:39] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1085.eqiad.wmnet with reason: host reimage [16:42:53] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2079 - https://phabricator.wikimedia.org/T313885 (10Papaul) [16:43:11] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:46:21] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2079 - https://phabricator.wikimedia.org/T313885 (10Papaul) 05Open→03Resolved complete [16:49:49] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135 [16:49:54] T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 [16:51:37] (03PS1) 10Btullis: Fix the spark3 profile [puppet] - 10https://gerrit.wikimedia.org/r/821278 (https://phabricator.wikimedia.org/T295072) [16:52:17] 10SRE, 10DynamicPageList (Wikimedia), 10serviceops, 10Patch-For-Review, and 7 others: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 (10Krinkle) [16:52:59] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36646/console" [puppet] - 10https://gerrit.wikimedia.org/r/821278 (https://phabricator.wikimedia.org/T295072) (owner: 10Btullis) [16:53:47] (03PS7) 10Ayounsi: sre.network.debug: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/812380 [16:53:49] (03PS9) 10Ayounsi: sre.network.peering: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/816730 [16:54:02] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1085.eqiad.wmnet with OS bullseye [16:54:48] (03PS8) 10Ayounsi: sre.network.debug: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/812380 [16:54:57] (03CR) 10Ayounsi: sre.network.debug: initial commit (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/812380 (owner: 10Ayounsi) [16:55:18] (03CR) 10Btullis: [V: 03+1 C: 03+2] Fix the spark3 profile [puppet] - 10https://gerrit.wikimedia.org/r/821278 (https://phabricator.wikimedia.org/T295072) (owner: 10Btullis) [16:56:18] (03PS1) 10Ahmon Dancy: DevServices.php: Add placeholder for image-suggestion service [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/821279 [16:57:56] (03CR) 10Ahmon Dancy: [C: 03+2] DevServices.php: Add placeholder for image-suggestion service [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/821279 (owner: 10Ahmon Dancy) [16:58:51] (03CR) 10CI reject: [V: 04-1] sre.network.peering: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/816730 (owner: 10Ayounsi) [16:59:03] (03Merged) 10jenkins-bot: DevServices.php: Add placeholder for image-suggestion service [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/821279 (owner: 10Ahmon Dancy) [16:59:48] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2089 - https://phabricator.wikimedia.org/T313799 (10Papaul) [17:00:02] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2089 - https://phabricator.wikimedia.org/T313799 (10Papaul) 05Open→03Resolved complete [17:00:05] ryankemper: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220808T1700). [17:00:28] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1088.eqiad.wmnet with OS bullseye [17:01:54] PROBLEM - Check systemd state on elastic1065 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:27] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Install NVMe SSDs into moss-be200[1|2] & thanos-be200? - https://phabricator.wikimedia.org/T310923 (10Papaul) @LSobanski hello any update on this? Thanks [17:03:39] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Install NVMe SSDs into moss-be200[1|2] & thanos-be200? - https://phabricator.wikimedia.org/T310923 (10Reedy) [17:04:22] RECOVERY - MariaDB Replica Lag: es4 on es2022 is OK: OK slave_sql_lag Replication lag: 0.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:09:58] RECOVERY - Check systemd state on elastic1065 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:51] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1088.eqiad.wmnet with reason: host reimage [17:15:46] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1088.eqiad.wmnet with reason: host reimage [17:34:12] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1088.eqiad.wmnet with OS bullseye [17:37:39] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to wmde and nda for Simon Kock (WMDE) - https://phabricator.wikimedia.org/T314563 (10KFrancis) @Dzahn I am confirming the NDA has been signed. Please proceed with the access request. Thanks! [17:58:36] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:59:23] (03PS1) 10CDanis: Add a check_esc_policy_config subcommand [software/klaxon] - 10https://gerrit.wikimedia.org/r/821287 (https://phabricator.wikimedia.org/T313603) [17:59:29] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:01:21] (03PS2) 10CDanis: Add a check_esc_policy_config subcommand [software/klaxon] - 10https://gerrit.wikimedia.org/r/821287 (https://phabricator.wikimedia.org/T313603) [18:20:01] (03PS1) 10Jdlrobson: Fix grid blowout bug [skins/Vector] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/821243 (https://phabricator.wikimedia.org/T314756) [18:23:57] (03PS2) 10Clare Ming: Disable sticky header edit A/B test for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817785 (https://phabricator.wikimedia.org/T312296) [18:40:23] (03PS1) 10Ottomata: Don't hardcode /opt/conda-analytics in spark3.env.sh.erb [puppet] - 10https://gerrit.wikimedia.org/r/821293 (https://phabricator.wikimedia.org/T312882) [18:41:28] (03PS3) 10CDanis: haproxy: properly track client concurrency, & more [puppet] - 10https://gerrit.wikimedia.org/r/812425 (https://phabricator.wikimedia.org/T306580) [18:43:04] (03PS3) 10Ori: abstract-wikipedia alert: increase timeout; correct team name [puppet] - 10https://gerrit.wikimedia.org/r/821294 (https://phabricator.wikimedia.org/T311457) [18:43:13] (03CR) 10Mary Yang: [C: 03+1] abstract-wikipedia alert: increase timeout; correct team name [puppet] - 10https://gerrit.wikimedia.org/r/821294 (https://phabricator.wikimedia.org/T311457) (owner: 10Ori) [18:45:31] (03PS4) 10CDanis: haproxy: properly track client concurrency, & more [puppet] - 10https://gerrit.wikimedia.org/r/812425 (https://phabricator.wikimedia.org/T306580) [18:45:44] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/812425 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis) [18:46:42] (03CR) 10Ori: [C: 03+2] abstract-wikipedia alert: increase timeout; correct team name [puppet] - 10https://gerrit.wikimedia.org/r/821294 (https://phabricator.wikimedia.org/T311457) (owner: 10Ori) [18:51:16] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:52:22] (03CR) 10CDanis: [C: 03+2] "Valentin: I'm going to merge this now so we can start gathering correct stats as quickly as possible, but I'm very happy to take comments " [puppet] - 10https://gerrit.wikimedia.org/r/812425 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis) [18:56:07] (03PS1) 10Ottomata: Fix sudo rules for airflow platform eng admins [puppet] - 10https://gerrit.wikimedia.org/r/821296 (https://phabricator.wikimedia.org/T313727) [18:58:18] (03CR) 10Andrew Bogott: [C: 03+2] trove-guestagent.conf: standardize rabbitmq config [puppet] - 10https://gerrit.wikimedia.org/r/821261 (https://phabricator.wikimedia.org/T314522) (owner: 10Andrew Bogott) [19:01:50] (03Abandoned) 10Andrew Bogott: nova.conf: remove auth_strategy = keystone [puppet] - 10https://gerrit.wikimedia.org/r/820758 (owner: 10Andrew Bogott) [19:03:55] (03CR) 10Ottomata: [C: 03+2] Fix sudo rules for airflow platform eng admins [puppet] - 10https://gerrit.wikimedia.org/r/821296 (https://phabricator.wikimedia.org/T313727) (owner: 10Ottomata) [19:12:43] (03PS1) 10Andrew Bogott: nova.conf: remove auth_strategy = keystone [puppet] - 10https://gerrit.wikimedia.org/r/821297 [19:12:45] (03PS1) 10Andrew Bogott: openstack::cinder: use TLS on rabbitmq connections [puppet] - 10https://gerrit.wikimedia.org/r/821298 (https://phabricator.wikimedia.org/T297268) [19:14:38] 10SRE, 10Performance-Team, 10serviceops, 10Patch-For-Review: Clean up testwiki experiments - https://phabricator.wikimedia.org/T314750 (10Krinkle) p:05Triage→03Low a:03tstarling [19:14:46] (03CR) 10Andrew Bogott: [C: 03+2] nova.conf: remove auth_strategy = keystone [puppet] - 10https://gerrit.wikimedia.org/r/821297 (owner: 10Andrew Bogott) [19:15:08] RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:22:12] PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:25:10] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:25:58] (03PS1) 10CDanis: haproxy: fix excess_concurrency/would_drop debug logging [puppet] - 10https://gerrit.wikimedia.org/r/821300 (https://phabricator.wikimedia.org/T306580) [19:26:46] (03CR) 10CDanis: [C: 03+2] "same proviso as previous patch ツ" [puppet] - 10https://gerrit.wikimedia.org/r/821300 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis) [19:28:13] (03CR) 10CDanis: [C: 03+2] "PCC LGTM https://puppet-compiler.wmflabs.org/pcc-worker1001/36648/cp2027.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/821300 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis) [19:28:44] andrewbogott: puppet-merging your patch as well [19:28:53] thanks! [19:34:05] 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10TAndic) [19:34:08] (03PS1) 10CDanis: haproxy: bump concurrency threshold [puppet] - 10https://gerrit.wikimedia.org/r/821301 (https://phabricator.wikimedia.org/T306580) [19:34:48] (03CR) 10CDanis: [C: 03+2] haproxy: bump concurrency threshold [puppet] - 10https://gerrit.wikimedia.org/r/821301 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis) [19:43:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): Dedicated cloudrabbit nodes in eqiad1 - https://phabricator.wikimedia.org/T314522 (10Andrew) Thanks for chiming in @ayounsi. Now that things are not urgently broken I have time to engage with your questions :) > Given the... [20:00:05] RoanKattouw, Urbanecm, and cjming: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220808T2000). [20:00:05] cjming: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:01:06] 10Puppet, 10Infrastructure-Foundations, 10netbox, 10PostgreSQL: Puppet change at each run on postgres replicas - https://phabricator.wikimedia.org/T311156 (10ayounsi) 05Open→03Resolved This seems to be fixed based on puppetboard. [20:02:08] i am the only one on the list so i will deploy [20:04:20] (03CR) 10Clare Ming: [C: 03+2] Disable sticky header edit A/B test for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817785 (https://phabricator.wikimedia.org/T312296) (owner: 10Clare Ming) [20:04:44] (03CR) 10Clare Ming: [C: 03+2] Fix grid blowout bug [skins/Vector] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/821243 (https://phabricator.wikimedia.org/T314756) (owner: 10Jdlrobson) [20:05:09] (03Merged) 10jenkins-bot: Disable sticky header edit A/B test for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817785 (https://phabricator.wikimedia.org/T312296) (owner: 10Clare Ming) [20:07:24] 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10jhathaway) @TAndic from reading their docs, I think they only support sending from a subdomain of wikimedia.org: > Attention: Adding a custom FROM d... [20:08:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:11:24] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:817785|Disable sticky header edit A/B test for pilot wikis (T312296)]] (duration: 03m 35s) [20:11:27] T312296: Disable sticky header edit button A/B test for pilot wikis - https://phabricator.wikimedia.org/T312296 [20:11:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:11:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:11:56] 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10jhathaway) a:03jhathaway [20:12:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:20:49] (03Merged) 10jenkins-bot: Fix grid blowout bug [skins/Vector] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/821243 (https://phabricator.wikimedia.org/T314756) (owner: 10Jdlrobson) [20:27:34] !log cjming@deploy1002 Synchronized php-1.39.0-wmf.23/skins/Vector/resources/skins.vector.styles/layouts/grid.less: Backport: [[gerrit:821243|Fix grid blowout bug (T314756)]] (duration: 03m 26s) [20:27:37] T314756: Grid blowout on various pages with long elements - https://phabricator.wikimedia.org/T314756 [20:27:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:28:03] !log end of UTC late backport window [20:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:21] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135 [20:29:24] T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 [20:31:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:31:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:32:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:36:12] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1062.eqiad.wmnet with OS bullseye [20:36:20] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1062.eqiad.wmnet with OS bullseye [20:41:36] (03PS1) 10Clare Ming: Enable sticky header edit A/B test for idwiki + viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821310 (https://phabricator.wikimedia.org/T312295) [20:50:10] (03PS1) 10Andrew Bogott: Openstack::nova and ::neutron: use service names for rabbit nodes [puppet] - 10https://gerrit.wikimedia.org/r/821311 (https://phabricator.wikimedia.org/T314522) [20:50:50] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1062.eqiad.wmnet with reason: host reimage [21:33:44] PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:34:36] 10SRE, 10LDAP-Access-Requests: Requesting access to wmf and ops for Clément Goubert - https://phabricator.wikimedia.org/T313902 (10BCornwall) a:05Joe→03BCornwall [21:34:51] 10SRE, 10LDAP-Access-Requests: Requesting access to wmf and ops for Clément Goubert - https://phabricator.wikimedia.org/T313902 (10BCornwall) [21:36:11] 10SRE, 10LDAP-Access-Requests: Requesting access to wmf and ops for Clément Goubert - https://phabricator.wikimedia.org/T313902 (10BCornwall) p:05Triage→03Medium [21:36:13] 10SRE, 10LDAP-Access-Requests: Requesting access to wmf and ops for Clément Goubert - https://phabricator.wikimedia.org/T313902 (10BCornwall) @akosiaris Can you sign off with your approval that this user is indeed the one to grant access? [21:36:25] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1065.eqiad.wmnet with OS bullseye [21:36:32] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1065.eqiad.wmnet with OS bullseye [21:38:20] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to wmde and nda for Simon Kock (WMDE) - https://phabricator.wikimedia.org/T314563 (10BCornwall) p:05Triage→03Medium [21:43:58] 10SRE, 10Acme-chief, 10Patch-For-Review: acme-chief is down: ValueError: OCSP response status is not successful so the property has no value - https://phabricator.wikimedia.org/T282490 (10BCornwall) 05Open→03Resolved a:03BCornwall @Dzahn I'm assuming you meant 0.3, which has long since been deployed. I... [21:45:03] (03PS1) 10Stang: Revert "trwiki: Change old and new vector logos for 500k articles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821330 [21:45:04] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytic Cluster for Muniza - https://phabricator.wikimedia.org/T292955 (10BCornwall) p:05Triage→03Medium [21:45:15] (03PS2) 10Stang: Revert "trwiki: Change old and new vector logos for 500k articles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821330 [21:50:25] (03CR) 10Stang: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819071 (https://phabricator.wikimedia.org/T314820) (owner: 10MdsShakil) [21:50:34] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1065.eqiad.wmnet with reason: host reimage [21:53:27] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1065.eqiad.wmnet with reason: host reimage [21:59:26] 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10TAndic) @jhathaway Good catch! I've set up one for **surveys.wikimedia.org** and updated the configuration in the sheet I shared, starting on line 11... [21:59:31] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:04:42] (03PS2) 10Clare Ming: Enable sticky header edit A/B test for idwiki + viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821310 (https://phabricator.wikimedia.org/T312295) [22:10:47] 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10jhathaway) >>! In T314815#8139092, @TAndic wrote: > @jhathaway Good catch! I've set up one for **surveys.wikimedia.org** and updated the configuratio... [22:16:39] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host elastic1065.eqiad.wmnet with OS bullseye [22:16:45] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1065.eqiad.wmnet with OS bullseye completed: - elastic1062 (... [22:16:47] !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135 [22:16:49] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1065.eqiad.wmnet with OS bullseye executed with errors: - el... [22:16:50] T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 [22:18:22] (03PS1) 10Clare Ming: Enable sticky header edit test on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821319 (https://phabricator.wikimedia.org/T312573) [22:38:26] 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10TAndic) For the difference, I'm specifically looking at Step 13 of https://www.qualtrics.com/support/survey-platform/distributions-module/email-distr... [22:44:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:49:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:58:07] 10SRE, 10Service-deployment-requests, 10artificial-intelligence: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10Aklapper) What is the relation of this task to `T214201`? Does this one block the other one (=subtask)? [23:06:37] (03Abandoned) 10BryanDavis: rabbitmq: Fix SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/816001 (https://phabricator.wikimedia.org/T308013) (owner: 10BryanDavis) [23:28:58] (03PS2) 10Tim Starling: Remove abandoned Echo job queue test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821041 (https://phabricator.wikimedia.org/T314750) [23:29:00] (03PS2) 10Tim Starling: Remove testwiki example.org link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821042 (https://phabricator.wikimedia.org/T314750) [23:29:02] (03PS2) 10Tim Starling: Remove wgVectorResponsive [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821043 (https://phabricator.wikimedia.org/T314750) [23:29:04] (03PS2) 10Tim Starling: Remove override for wgRevisionCacheExpiry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821044 (https://phabricator.wikimedia.org/T314750) [23:29:06] (03PS2) 10Tim Starling: Remove testwiki wgTorTagChanges override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821045 (https://phabricator.wikimedia.org/T314750) [23:29:08] (03PS2) 10Tim Starling: Remove testwiki live preview demo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821046 (https://phabricator.wikimedia.org/T314750) [23:29:10] (03PS2) 10Tim Starling: Remove unnecessary override for wmgUseCLDR [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821047 (https://phabricator.wikimedia.org/T314750) [23:29:12] (03PS2) 10Tim Starling: Remove wmgDisplayFeedsInSidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821048 (https://phabricator.wikimedia.org/T314750) [23:29:14] (03PS2) 10Tim Starling: Remove wmgUseWikimediaShopLink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821049 (https://phabricator.wikimedia.org/T310365) [23:31:41] (03PS1) 10Cwhite: logstash route k8s logs from proxy,httpd containers to webrequest partition [puppet] - 10https://gerrit.wikimedia.org/r/821323 (https://phabricator.wikimedia.org/T314139) [23:31:45] (03PS1) 10Cwhite: logstash: reduce webrequest retention to 31 days [puppet] - 10https://gerrit.wikimedia.org/r/821324 (https://phabricator.wikimedia.org/T314139) [23:32:42] (03PS2) 10Cwhite: logstash route k8s logs from proxy,httpd containers to webrequest partition [puppet] - 10https://gerrit.wikimedia.org/r/821323 (https://phabricator.wikimedia.org/T314139) [23:33:01] (03PS3) 10Cwhite: logstash route k8s logs from proxy,httpd containers to webrequest partition [puppet] - 10https://gerrit.wikimedia.org/r/821323 (https://phabricator.wikimedia.org/T314139) [23:33:44] RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:36:00] (03CR) 10Tim Starling: [C: 03+2] Remove wmgUseWikimediaShopLink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821049 (https://phabricator.wikimedia.org/T310365) (owner: 10Tim Starling) [23:36:05] (03CR) 10Tim Starling: [C: 03+2] Remove wmgDisplayFeedsInSidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821048 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling) [23:36:14] (03CR) 10Tim Starling: [C: 03+2] Remove unnecessary override for wmgUseCLDR [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821047 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling) [23:36:23] (03CR) 10Tim Starling: [C: 03+2] Remove testwiki live preview demo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821046 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling) [23:36:33] (03CR) 10Tim Starling: [C: 03+2] Remove testwiki wgTorTagChanges override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821045 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling) [23:36:42] (03CR) 10Tim Starling: [C: 03+2] Remove override for wgRevisionCacheExpiry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821044 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling) [23:36:53] (03CR) 10Tim Starling: [C: 03+2] Remove wgVectorResponsive [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821043 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling) [23:37:07] (03CR) 10Tim Starling: [C: 03+2] Remove testwiki example.org link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821042 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling) [23:37:19] (03CR) 10Tim Starling: [C: 03+2] Remove abandoned Echo job queue test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821041 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling) [23:38:23] (03Merged) 10jenkins-bot: Remove abandoned Echo job queue test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821041 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling) [23:38:26] (03Merged) 10jenkins-bot: Remove testwiki example.org link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821042 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling) [23:38:29] (03Merged) 10jenkins-bot: Remove wgVectorResponsive [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821043 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling) [23:38:33] (03Merged) 10jenkins-bot: Remove override for wgRevisionCacheExpiry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821044 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling) [23:38:42] (03Merged) 10jenkins-bot: Remove testwiki wgTorTagChanges override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821045 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling) [23:38:44] (03Merged) 10jenkins-bot: Remove testwiki live preview demo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821046 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling) [23:38:47] (03Merged) 10jenkins-bot: Remove unnecessary override for wmgUseCLDR [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821047 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling) [23:39:15] (03Merged) 10jenkins-bot: Remove wmgDisplayFeedsInSidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821048 (https://phabricator.wikimedia.org/T314750) (owner: 10Tim Starling) [23:39:23] (03Merged) 10jenkins-bot: Remove wmgUseWikimediaShopLink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821049 (https://phabricator.wikimedia.org/T310365) (owner: 10Tim Starling) [23:45:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [23:46:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [23:46:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [23:46:49] !log tstarling@deploy1002 Synchronized wmf-config/CommonSettings.php: clean up testwiki experiments T314750 (duration: 03m 27s) [23:46:53] T314750: Clean up testwiki experiments - https://phabricator.wikimedia.org/T314750 [23:47:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [23:52:20] !log tstarling@deploy1002 Synchronized wmf-config/InitialiseSettings.php: clean up testwiki experiments T314750 (duration: 03m 19s) [23:52:24] T314750: Clean up testwiki experiments - https://phabricator.wikimedia.org/T314750 [23:53:06] 10SRE, 10Performance-Team, 10serviceops, 10Patch-For-Review: Clean up testwiki experiments - https://phabricator.wikimedia.org/T314750 (10tstarling) 05Open→03Resolved [23:53:14] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) [23:59:48] (03PS2) 10Andrew Bogott: openstack::nova: use TLS on rabbitmq connections [puppet] - 10https://gerrit.wikimedia.org/r/821298 (https://phabricator.wikimedia.org/T297268)