[00:05:37] PROBLEM - Check systemd state on grafana1002 is CRITICAL: CRITICAL - degraded: The following units failed: grafana-ldap-users-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:12:57] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability, 10serviceops: ensure httpd error logs from "misc apps" (krypton) end up in logstash - https://phabricator.wikimedia.org/T216090 (10lmata) Thank you for the update this will help for backlog priorities setting. [00:24:37] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:26:33] RECOVERY - SSH on ms-fe2008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:32:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [00:39:43] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:42:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [00:47:37] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:48:39] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:50:25] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:55:11] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:35:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [01:40:59] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:45:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [02:07:33] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.38.0-wmf.18 [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754606 [02:07:35] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.38.0-wmf.18 [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754606 (owner: 10TrainBranchBot) [02:13:03] 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops, and 2 others: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10tstarling) > 52 seconds in Shellbox\Client::computeHmac over 3 calls, I guess all signatures for the remote shellbox calls I benchmarked the SHA-256 HMAC... [02:23:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [02:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:24:59] (03CR) 10jerkins-bot: [V: 04-1] Branch commit for wmf/1.38.0-wmf.18 [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754606 (owner: 10TrainBranchBot) [02:26:39] (03Merged) 10jenkins-bot: Branch commit for wmf/1.38.0-wmf.18 [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754606 (owner: 10TrainBranchBot) [02:26:57] PROBLEM - SSH on mw2254.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:29:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [02:29:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [02:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:33:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [02:36:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [02:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:41:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [02:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:43:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [02:48:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [02:48:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [02:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:54:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [02:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:08:39] PROBLEM - SSH on mw2257.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:09:47] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:13:29] 10SRE, 10Analytics, 10Data-Engineering, 10Event-Platform, 10Sustainability (Incident Followup): Pool eventgate-main in both datacenters (active/active) - https://phabricator.wikimedia.org/T296699 (10Ottomata) 05Open→03Resolved a:03Ottomata Yup should be! [03:28:55] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:34:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [03:39:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [04:11:01] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:12:04] (03PS1) 104nn1l2: commonswiki: Add peerj.com to wgCopyUploadsDomains whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754612 (https://phabricator.wikimedia.org/T299247) [04:29:33] RECOVERY - SSH on mw2254.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:30:11] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:35:07] (03PS1) 104nn1l2: azwiki: Add draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754613 (https://phabricator.wikimedia.org/T299332) [04:37:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [04:42:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [05:32:05] (03PS1) 10KartikMistry: Update apertium to 2022-01-18-052631-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/754614 (https://phabricator.wikimedia.org/T218184) [05:35:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [05:41:28] * kart_ deploying Apertium service.. [05:42:25] (03CR) 10KartikMistry: [C: 03+2] Update apertium to 2022-01-18-052631-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/754614 (https://phabricator.wikimedia.org/T218184) (owner: 10KartikMistry) [05:45:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [05:45:55] (03Merged) 10jenkins-bot: Update apertium to 2022-01-18-052631-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/754614 (https://phabricator.wikimedia.org/T218184) (owner: 10KartikMistry) [05:46:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove watchlist group from s3 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P18764 and previous config saved to /var/cache/conftool/dbconfig/20220118-054659-marostegui.json [05:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:04] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [05:48:04] (03PS1) 10Marostegui: Revert "dbproxy1017: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/754590 [05:49:02] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/apertium: apply on staging [05:49:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:04] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/apertium: apply on production [05:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:10] (03PS1) 10Marostegui: Revert "dbproxy1015: Reimage to Bullseye" [puppet] - 10https://gerrit.wikimedia.org/r/754591 [05:49:36] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/apertium: sync on staging [05:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:50] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1017: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/754590 (owner: 10Marostegui) [05:49:59] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1015: Reimage to Bullseye" [puppet] - 10https://gerrit.wikimedia.org/r/754591 (owner: 10Marostegui) [05:51:40] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/apertium: apply on production [05:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:43] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/apertium: apply on staging [05:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:20] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/apertium: sync on production [05:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:01] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/apertium: apply on production [05:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:04] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/apertium: apply on staging [05:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:38] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/apertium: apply on production [05:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:40] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/apertium: apply on staging [05:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:01] Uh oh. Too much logging? [05:56:19] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/apertium: sync on production [05:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:36] (03PS1) 10Marostegui: pc1014: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/754784 (https://phabricator.wikimedia.org/T299046) [05:57:56] Looks OK then! Apertium service is on Bullseye now! [05:58:51] !log Update apertium to 2022-01-18-052631-production (T218184, T202276, T218184, T270061, T248653, T248293, T248812, T248654) [05:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:02] T248293: Update apertium-af-nl package - https://phabricator.wikimedia.org/T248293 [05:59:03] T202276: Package apertium-pol-szl (Polish-Silesian) - https://phabricator.wikimedia.org/T202276 [05:59:03] T248812: Update apertium-bel-rus package - https://phabricator.wikimedia.org/T248812 [05:59:03] T270061: Update apertium-ita-srd (Italian-Sardinian) - https://phabricator.wikimedia.org/T270061 [05:59:03] T218184: Update apertium-nno-nob, apertium-swe-dan, apertium-swe-nor and apertium-dan-nor packages - https://phabricator.wikimedia.org/T218184 [05:59:04] T248653: Update apertium-id-ms to 0.1.2 - https://phabricator.wikimedia.org/T248653 [05:59:04] T248654: Update apertium-ca-it to 0.2.1 - https://phabricator.wikimedia.org/T248654 [05:59:43] (03CR) 10Marostegui: [C: 03+2] pc1014: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/754784 (https://phabricator.wikimedia.org/T299046) (owner: 10Marostegui) [06:00:58] (03PS1) 10Marostegui: pc1014: Move it to pc2 [puppet] - 10https://gerrit.wikimedia.org/r/754805 (https://phabricator.wikimedia.org/T299046) [06:01:47] (03CR) 10Marostegui: [C: 03+2] pc1014: Move it to pc2 [puppet] - 10https://gerrit.wikimedia.org/r/754805 (https://phabricator.wikimedia.org/T299046) (owner: 10Marostegui) [06:02:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host pc1014.eqiad.wmnet with OS bullseye [06:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:07] (03PS1) 10Marostegui: realm.pp: Add ipinfo_ip_changes to private tables [puppet] - 10https://gerrit.wikimedia.org/r/754806 (https://phabricator.wikimedia.org/T297696) [06:13:23] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host pc1014.eqiad.wmnet with OS bullseye [06:13:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host pc1014.eqiad.wmnet with OS bullseye [06:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:07] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host pc1014.eqiad.wmnet with OS bullseye [06:34:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [06:44:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [06:46:54] ACKNOWLEDGEMENT - MegaRAID on pc1014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.89. Check system logs on 10.64.48.89 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T299376 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:47:02] 10SRE, 10ops-eqiad: Degraded RAID on pc1014 - https://phabricator.wikimedia.org/T299376 (10ops-monitoring-bot) [06:47:40] 10SRE, 10ops-eqiad: Degraded RAID on pc1014 - https://phabricator.wikimedia.org/T299376 (10Marostegui) 05Open→03Declined The host is being reimaged [06:52:25] (03PS1) 10Giuseppe Lavagetto: service::monitor: do not include services with just probes [puppet] - 10https://gerrit.wikimedia.org/r/754808 [06:53:04] (03CR) 10jerkins-bot: [V: 04-1] service::monitor: do not include services with just probes [puppet] - 10https://gerrit.wikimedia.org/r/754808 (owner: 10Giuseppe Lavagetto) [06:54:42] (03PS2) 10Giuseppe Lavagetto: service::monitor: do not include services with just probes [puppet] - 10https://gerrit.wikimedia.org/r/754808 [06:57:45] (03PS3) 10Giuseppe Lavagetto: service::monitor: do not include services with just probes [puppet] - 10https://gerrit.wikimedia.org/r/754808 [07:01:07] (03CR) 10Giuseppe Lavagetto: [C: 03+2] service::monitor: do not include services with just probes [puppet] - 10https://gerrit.wikimedia.org/r/754808 (owner: 10Giuseppe Lavagetto) [07:09:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host pc1014.eqiad.wmnet with OS bullseye [07:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:13] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:27:33] PROBLEM - SSH on restbase2010.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:29:35] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:32:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [07:36:10] (03PS15) 10Elukey: kafka: add check to test the Broker's TLS port [puppet] - 10https://gerrit.wikimedia.org/r/753738 [07:36:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc1014.eqiad.wmnet with OS bullseye [07:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:58] (03PS16) 10Elukey: kafka: add check to test the Broker's TLS port [puppet] - 10https://gerrit.wikimedia.org/r/753738 [07:38:21] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33287/console" [puppet] - 10https://gerrit.wikimedia.org/r/753738 (owner: 10Elukey) [07:42:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [07:47:50] (03CR) 10Elukey: [V: 03+1] "elukey@alert1001:/usr/lib/nagios/plugins$ ./check_ssl -H kafka-main1001.eqiad.wmnet -p 9093 -w 30 -c 30" [puppet] - 10https://gerrit.wikimedia.org/r/753738 (owner: 10Elukey) [07:48:56] (03CR) 10Ladsgroup: [C: 03+1] realm.pp: Add ipinfo_ip_changes to private tables [puppet] - 10https://gerrit.wikimedia.org/r/754806 (https://phabricator.wikimedia.org/T297696) (owner: 10Marostegui) [07:49:18] (03CR) 10Marostegui: [C: 03+2] realm.pp: Add ipinfo_ip_changes to private tables [puppet] - 10https://gerrit.wikimedia.org/r/754806 (https://phabricator.wikimedia.org/T297696) (owner: 10Marostegui) [07:53:09] (03PS17) 10Elukey: kafka: add check to test the Broker's TLS port [puppet] - 10https://gerrit.wikimedia.org/r/753738 [07:53:51] (03PS18) 10Elukey: kafka: add check to test the Broker's TLS port [puppet] - 10https://gerrit.wikimedia.org/r/753738 [07:54:21] (03CR) 10Elukey: "Adding Filippo for the nagios part :)" [puppet] - 10https://gerrit.wikimedia.org/r/753738 (owner: 10Elukey) [07:54:33] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33289/console" [puppet] - 10https://gerrit.wikimedia.org/r/753738 (owner: 10Elukey) [07:57:12] (03CR) 10Elukey: [V: 03+1] kafka: add check to test the Broker's TLS port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753738 (owner: 10Elukey) [08:04:25] (03CR) 10Ladsgroup: [C: 03+2] Drop 'inline-media-caption' lint requests [extensions/Linter] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/754144 (https://phabricator.wikimedia.org/T297443) (owner: 10Subramanya Sastry) [08:05:33] I'll be deploying several backports [08:07:12] (03Merged) 10jenkins-bot: Drop 'inline-media-caption' lint requests [extensions/Linter] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/754144 (https://phabricator.wikimedia.org/T297443) (owner: 10Subramanya Sastry) [08:09:00] RECOVERY - HTTPS-wmfusercontent on phab.wmfusercontent.org is OK: SSL OK - Certificate *.wikipedia.org valid until 2022-04-11 07:59:19 +0000 (expires in 82 days) https://phabricator.wikimedia.org/tag/phabricator/ [08:10:26] (03CR) 10Ladsgroup: [C: 03+2] "This change is ready for review." [extensions/ProofreadPage] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/754598 (https://phabricator.wikimedia.org/T292300) (owner: 10Ladsgroup) [08:12:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [08:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:50] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.17/extensions/Linter/includes/RecordLintJob.php: Backport: [[gerrit:754144|Drop 'inline-media-caption' lint requests (T297443 T299302)]] (duration: 00m 52s) [08:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:55] T299302: Linter jobs are running slowly - https://phabricator.wikimedia.org/T299302 [08:12:55] T297443: Add a linter category for inline images with captions - https://phabricator.wikimedia.org/T297443 [08:13:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [08:13:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [08:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [08:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:46] (03CR) 10Muehlenhoff: [C: 03+2] Enable ganeti 2.16 in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/754540 (https://phabricator.wikimedia.org/T296721) (owner: 10Muehlenhoff) [08:20:35] !log cleaning up commons linter errors T298782 [08:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:38] T298782: Linter seems to be not cleaning up after page deletion - https://phabricator.wikimedia.org/T298782 [08:21:10] (03PS1) 10Marostegui: ProductionServices.php: Promote pc1014 to pc2 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754864 (https://phabricator.wikimedia.org/T299046) [08:23:00] (03PS1) 10Marostegui: mariadb: Promote pc1014 to pc2 master [puppet] - 10https://gerrit.wikimedia.org/r/754865 (https://phabricator.wikimedia.org/T299046) [08:24:30] (03CR) 10Ladsgroup: [C: 03+1] ProductionServices.php: Promote pc1014 to pc2 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754864 (https://phabricator.wikimedia.org/T299046) (owner: 10Marostegui) [08:28:18] (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc1014 to pc2 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754864 (https://phabricator.wikimedia.org/T299046) (owner: 10Marostegui) [08:28:24] (03Merged) 10jenkins-bot: Use fillParserOutputInternal instead of getParserOutput. [extensions/ProofreadPage] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/754598 (https://phabricator.wikimedia.org/T292300) (owner: 10Ladsgroup) [08:28:34] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote pc1014 to pc2 master [puppet] - 10https://gerrit.wikimedia.org/r/754865 (https://phabricator.wikimedia.org/T299046) (owner: 10Marostegui) [08:29:01] (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc1014 to pc2 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754864 (https://phabricator.wikimedia.org/T299046) (owner: 10Marostegui) [08:29:20] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Overall lgtm, but we need to work on the regexes." [puppet] - 10https://gerrit.wikimedia.org/r/724049 (https://phabricator.wikimedia.org/T205361) (owner: 10Majavah) [08:30:28] !log marostegui@deploy1002 Synchronized wmf-config/ProductionServices.php: Promote pc1014 to master in pc2 T299046 (duration: 00m 51s) [08:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:32] T299046: Upgrade parsercache infra to Bullseye - https://phabricator.wikimedia.org/T299046 [08:32:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [08:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host pc1012.eqiad.wmnet with OS bullseye [08:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:50] (03CR) 10Matthias Mullie: [C: 03+1] "Ready to be deployed!" [extensions/MediaSearch] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/753487 (owner: 10Cparle) [08:36:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [08:36:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [08:36:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [08:37:45] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.17/extensions/ProofreadPage/includes/Page/PageContentHandler.php: Backport: [[gerrit:754598|Use fillParserOutputInternal instead of getParserOutput. (T292300)]] (duration: 00m 51s) [08:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:48] T292300: Eliminate unnecessary duplicate parses - https://phabricator.wikimedia.org/T292300 [08:38:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [08:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:31] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on build2001.codfw.wmnet with reason: reinstallation [08:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on build2001.codfw.wmnet with reason: reinstallation [08:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [08:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:17] (03PS1) 10Ladsgroup: watcheditem: Try getting the cached version in resetNotificationTimestamp [core] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/754599 [08:43:24] (03CR) 10Ladsgroup: [C: 03+2] watcheditem: Try getting the cached version in resetNotificationTimestamp [core] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/754599 (owner: 10Ladsgroup) [08:44:24] (03CR) 10Filippo Giunchedi: [C: 03+1] kafka: add check to test the Broker's TLS port [puppet] - 10https://gerrit.wikimedia.org/r/753738 (owner: 10Elukey) [08:46:31] 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops, and 2 others: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10Joe) >>! In T292322#7627149, @tstarling wrote: >> 52 seconds in Shellbox\Client::computeHmac over 3 calls, I guess all signatures for the remote shellbox... [08:46:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [08:47:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [08:47:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [08:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:28] 10SRE, 10Analytics-Radar: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10akosiaris) This is starting to show up rather frequently, so I am wondering whether it is starting to consume enough time to warrant solving it somehow. Finding the race might prove... [08:49:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [08:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:06] (03CR) 10Elukey: [V: 03+1 C: 03+2] "Going to test this!" [puppet] - 10https://gerrit.wikimedia.org/r/753738 (owner: 10Elukey) [08:52:17] (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc1014 to pc2 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754600 [08:52:29] (03PS1) 10Marostegui: Revert "mariadb: Promote pc1014 to pc2 master" [puppet] - 10https://gerrit.wikimedia.org/r/754601 [08:55:06] (03CR) 10Ladsgroup: [C: 03+1] Revert "ProductionServices.php: Promote pc1014 to pc2 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754600 (owner: 10Marostegui) [08:55:15] !log jmm@cumin2002 START - Cookbook sre.puppet.renew-cert for build2001.codfw.wmnet: Renew puppet certificate - jmm@cumin2002 [08:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:27] !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for build2001.codfw.wmnet: Renew puppet certificate - jmm@cumin2002 [08:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:58] Amir1: ping me when done backporting, please? [08:56:04] 10SRE, 10SRE-Access-Requests: Requesting access to analytics clients for mfossati - https://phabricator.wikimedia.org/T299343 (10Jelto) [08:56:15] sure [08:56:53] (03PS2) 10Volans: redfish: improve support for DRY-RUN mode [software/spicerack] - 10https://gerrit.wikimedia.org/r/749852 [08:57:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc1012.eqiad.wmnet with OS bullseye [08:57:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:55] (03Merged) 10jenkins-bot: watcheditem: Try getting the cached version in resetNotificationTimestamp [core] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/754599 (owner: 10Ladsgroup) [09:04:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [09:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [09:05:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [09:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:40] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.17/includes/watcheditem/WatchedItemStore.php: Backport: [[gerrit:754599|watcheditem: Try getting the cached version in resetNotificationTimestamp]] (duration: 00m 51s) [09:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:21] (03CR) 10Volans: "addressed comment" [software/spicerack] - 10https://gerrit.wikimedia.org/r/749852 (owner: 10Volans) [09:09:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [09:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:08] (03PS1) 10Ladsgroup: page: Use MainObjectStash instead of 'db-replicated' cache [core] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/754602 (https://phabricator.wikimedia.org/T272512) [09:10:28] (03CR) 10Ladsgroup: [C: 03+2] page: Use MainObjectStash instead of 'db-replicated' cache [core] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/754602 (https://phabricator.wikimedia.org/T272512) (owner: 10Ladsgroup) [09:11:59] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10MoritzMuehlenhoff) [09:14:24] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:15:11] (03CR) 10Ladsgroup: [C: 03+2] Disable "inline-media-caption" category [extensions/Linter] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/754145 (https://phabricator.wikimedia.org/T297443) (owner: 10Subramanya Sastry) [09:16:10] (03CR) 10Volans: "post merge nit inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/751228 (https://phabricator.wikimedia.org/T239814) (owner: 10Ladsgroup) [09:18:54] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:28:08] (03Merged) 10jenkins-bot: page: Use MainObjectStash instead of 'db-replicated' cache [core] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/754602 (https://phabricator.wikimedia.org/T272512) (owner: 10Ladsgroup) [09:28:13] (03Merged) 10jenkins-bot: Disable "inline-media-caption" category [extensions/Linter] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/754145 (https://phabricator.wikimedia.org/T297443) (owner: 10Subramanya Sastry) [09:31:16] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.17/extensions/Linter/extension.json: Backport: [[gerrit:754145|Disable "inline-media-caption" category (T297443)]] (duration: 00m 51s) [09:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:20] T297443: Add a linter category for inline images with captions - https://phabricator.wikimedia.org/T297443 [09:32:41] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.17/includes: Backport: [[gerrit:754602|page: Use MainObjectStash instead of 'db-replicated' cache (T272512)]] (duration: 00m 56s) [09:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:45] T272512: Apply outstanding schema changes for "objectcache" tables in production (exptime, flags, modtoken) - https://phabricator.wikimedia.org/T272512 [09:33:18] PROBLEM - Kafka broker TLS certificate validity on kafka-test1009 is CRITICAL: SSL CRITICAL - Certificate kafka-test1009.eqiad.wmnet valid until 2022-01-29 19:27:00 +0000 (expires in 11 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [09:33:51] taavi: the floor is yours [09:33:53] thanks [09:34:03] (03PS2) 10Majavah: Enable temporary global user groups on production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752344 (https://phabricator.wikimedia.org/T153815) [09:34:14] (03CR) 10Majavah: [C: 03+2] Enable temporary global user groups on production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752344 (https://phabricator.wikimedia.org/T153815) (owner: 10Majavah) [09:34:22] (03PS1) 10Joal: Add an-test-cord1001 to analytics rsync allow list [puppet] - 10https://gerrit.wikimedia.org/r/754869 [09:34:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [09:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:00] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:35:15] (03Merged) 10jenkins-bot: Enable temporary global user groups on production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752344 (https://phabricator.wikimedia.org/T153815) (owner: 10Majavah) [09:35:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [09:35:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [09:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [09:36:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [09:36:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:26] (03CR) 10Btullis: [C: 03+2] Add an-test-cord1001 to analytics rsync allow list [puppet] - 10https://gerrit.wikimedia.org/r/754869 (owner: 10Joal) [09:38:12] PROBLEM - Kafka broker TLS certificate validity on kafka-test1007 is CRITICAL: SSL CRITICAL - Certificate kafka-test1007.eqiad.wmnet valid until 2022-01-29 19:16:00 +0000 (expires in 11 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [09:41:03] !log taavi@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:752344|Enable temporary global user groups on production (T153815)]] (duration: 00m 51s) [09:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:06] T153815: Allow global groups to be assigned temporarily (expire) - https://phabricator.wikimedia.org/T153815 [09:41:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [09:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:24] taavi: hey, would you at some point be willing to run the maintenance script for fixing the wrong entries in the globalblocks table? [09:44:37] zabe: sure, thanks for reminding me [09:45:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [09:45:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [09:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [09:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:57] the kafka-test tls broker etc.. are my fault, new alert [09:47:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [09:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:01] !log installing ganeti 2.16.0-1~bpo9+1+wmf1 on ganeti/eqiad servers T296721 [09:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:05] T296721: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 [09:50:12] PROBLEM - Kafka broker TLS certificate validity on kafka-test1010 is CRITICAL: SSL CRITICAL - Certificate kafka-test1010.eqiad.wmnet valid until 2022-01-29 19:09:00 +0000 (expires in 11 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [09:50:23] !log mwscript extensions/GlobalBlocking/maintenance/FixBlockerUsername.php --wiki metawiki "QuiteUnusual" "MarcGarver" # T298707 [09:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:27] T298707: "InvalidArgumentException: Blocker must be a local user" from GlobalBlocking - https://phabricator.wikimedia.org/T298707 [09:50:31] zabe: ^ was that the only case? [09:52:29] as far as I know [09:52:46] PROBLEM - Kafka broker TLS certificate validity on kafka-test1008 is CRITICAL: SSL CRITICAL - Certificate kafka-test1008.eqiad.wmnet valid until 2022-01-29 19:19:00 +0000 (expires in 11 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [09:57:05] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10MoritzMuehlenhoff) [09:57:45] (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Promote pc1014 to pc2 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754600 (owner: 10Marostegui) [09:57:49] (03CR) 10Marostegui: [C: 03+2] Revert "mariadb: Promote pc1014 to pc2 master" [puppet] - 10https://gerrit.wikimedia.org/r/754601 (owner: 10Marostegui) [09:57:52] (03PS1) 10Elukey: nagios_common: update check_ssl_kafka warning/critical values [puppet] - 10https://gerrit.wikimedia.org/r/754870 [09:58:32] (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Promote pc1014 to pc2 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754600 (owner: 10Marostegui) [09:59:14] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 269 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:59:39] !log marostegui@deploy1002 Synchronized wmf-config/ProductionServices.php: Revert: Promote pc1014 to master in pc2 T299046 (duration: 00m 50s) [09:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:43] T299046: Upgrade parsercache infra to Bullseye - https://phabricator.wikimedia.org/T299046 [10:00:45] !log Move pc1014 to pc3 T299046 [10:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:29] !log running gnt-cluster renew-crypto --new-cluster-certificate --new-rapi-certificate --new-spice-certificate for ganeti/eqiad cluster [10:01:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [10:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [10:03:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [10:03:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:33] (03PS1) 10Marostegui: pc1014: Move to pc3 [puppet] - 10https://gerrit.wikimedia.org/r/754871 (https://phabricator.wikimedia.org/T299046) [10:04:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [10:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:53] (03CR) 10Elukey: [C: 03+2] nagios_common: update check_ssl_kafka warning/critical values [puppet] - 10https://gerrit.wikimedia.org/r/754870 (owner: 10Elukey) [10:07:08] PROBLEM - HTTPS Ganeti RAPI eqiad on ganeti1009 is CRITICAL: connect to address ganeti01.svc.eqiad.wmnet and port 5080: Connection refused https://www.mediawiki.org/wiki/Ganeti%23RAPI_daemon [10:07:22] PROBLEM - ganeti-noded running on ganeti1009 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [10:07:40] PROBLEM - ganeti-confd running on ganeti1009 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [10:07:50] PROBLEM - ganeti-mond running on ganeti1009 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [10:08:58] ^ these are expected, the daemons are stopped while the certs are regenerated/distributed, will recover soon [10:09:40] RECOVERY - ganeti-noded running on ganeti1009 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [10:11:58] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_eqiad_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:13:40] PROBLEM - Check unit status of netbox_ganeti_eqiad_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:14:02] (03CR) 10Marostegui: [C: 03+2] pc1014: Move to pc3 [puppet] - 10https://gerrit.wikimedia.org/r/754871 (https://phabricator.wikimedia.org/T299046) (owner: 10Marostegui) [10:16:00] RECOVERY - SSH on mw2257.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:16:30] RECOVERY - HTTPS Ganeti RAPI eqiad on ganeti1009 is OK: HTTP OK: Status line output matched 401 - 309 bytes in 0.014 second response time https://www.mediawiki.org/wiki/Ganeti%23RAPI_daemon [10:17:02] RECOVERY - ganeti-confd running on ganeti1009 is OK: PROCS OK: 1 process with UID = 113 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [10:17:12] RECOVERY - ganeti-mond running on ganeti1009 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [10:20:53] (03PS2) 10Ayounsi: LibreNMS report, only log_info devices with no IP [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/753731 [10:21:07] (03PS1) 10Volans: sre.mysql.upgrade: various improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/754872 (https://phabricator.wikimedia.org/T239814) [10:21:59] RECOVERY - Kafka broker TLS certificate validity on kafka-test1010 is OK: SSL OK - Certificate kafka-test1010.eqiad.wmnet valid until 2022-01-29 19:09:00 +0000 (expires in 11 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [10:22:23] RECOVERY - Kafka broker TLS certificate validity on kafka-test1008 is OK: SSL OK - Certificate kafka-test1008.eqiad.wmnet valid until 2022-01-29 19:19:00 +0000 (expires in 11 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [10:22:46] (03PS1) 10Filippo Giunchedi: wmnet: move reads to graphite1004 [dns] - 10https://gerrit.wikimedia.org/r/754874 (https://phabricator.wikimedia.org/T299383) [10:22:48] (03PS1) 10Filippo Giunchedi: wmnet: move writes to graphite1004 [dns] - 10https://gerrit.wikimedia.org/r/754875 (https://phabricator.wikimedia.org/T299383) [10:23:39] RECOVERY - Check unit status of netbox_ganeti_eqiad_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:29:07] (03PS2) 10Volans: sre.mysql.upgrade: various improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/754872 (https://phabricator.wikimedia.org/T239814) [10:29:23] (03PS1) 10Filippo Giunchedi: Revert "graphite: check graphite2003 metrics" [puppet] - 10https://gerrit.wikimedia.org/r/754876 (https://phabricator.wikimedia.org/T299383) [10:29:25] (03PS1) 10Filippo Giunchedi: Revert "profile: move statsd writes to graphite2003" [puppet] - 10https://gerrit.wikimedia.org/r/754877 (https://phabricator.wikimedia.org/T299383) [10:30:04] (03PS1) 10Marostegui: db1117: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/754878 (https://phabricator.wikimedia.org/T299344) [10:30:53] (03CR) 10Marostegui: [C: 03+2] db1117: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/754878 (https://phabricator.wikimedia.org/T299344) (owner: 10Marostegui) [10:31:09] (03CR) 10Volans: "DO NOT MERGE AS IS" [cookbooks] - 10https://gerrit.wikimedia.org/r/754872 (https://phabricator.wikimedia.org/T239814) (owner: 10Volans) [10:31:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1117.eqiad.wmnet with OS bullseye [10:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:54] (03PS1) 10Filippo Giunchedi: Revert "ProductionServices: use graphite2003 for statsd" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754879 (https://phabricator.wikimedia.org/T299383) [10:32:34] (03CR) 10Volans: [C: 03+1] "LGTM, let's try this way and we can re-evaluate later" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/753731 (owner: 10Ayounsi) [10:32:53] PROBLEM - haproxy failover on dbproxy1015 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [10:32:56] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10MoritzMuehlenhoff) [10:33:01] haproxy alerts are expected [10:33:21] PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [10:33:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [10:34:31] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/749852 (owner: 10Volans) [10:35:27] (03CR) 10Ayounsi: [C: 03+2] LibreNMS report, only log_info devices with no IP [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/753731 (owner: 10Ayounsi) [10:35:37] PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [10:35:48] (03CR) 10Ayounsi: [C: 03+2] LibreNMS report, only log_info devices with no IP (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/753731 (owner: 10Ayounsi) [10:36:10] (03Merged) 10jenkins-bot: LibreNMS report, only log_info devices with no IP [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/753731 (owner: 10Ayounsi) [10:36:27] ACKNOWLEDGEMENT - haproxy failover on dbproxy1015 is CRITICAL: CRITICAL check_failover servers up 1 down 1: Marostegui known https://wikitech.wikimedia.org/wiki/HAProxy [10:36:27] ACKNOWLEDGEMENT - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: Marostegui known https://wikitech.wikimedia.org/wiki/HAProxy [10:37:14] ACKNOWLEDGEMENT - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1: Marostegui known https://wikitech.wikimedia.org/wiki/HAProxy [10:37:39] ACKNOWLEDGEMENT - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1: Marostegui known https://wikitech.wikimedia.org/wiki/HAProxy [10:37:55] PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [10:40:33] (03CR) 10Volans: [C: 03+2] redfish: improve support for DRY-RUN mode [software/spicerack] - 10https://gerrit.wikimedia.org/r/749852 (owner: 10Volans) [10:40:57] RECOVERY - Kafka broker TLS certificate validity on kafka-test1009 is OK: SSL OK - Certificate kafka-test1009.eqiad.wmnet valid until 2022-01-29 19:27:00 +0000 (expires in 11 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [10:40:57] RECOVERY - Kafka broker TLS certificate validity on kafka-test1007 is OK: SSL OK - Certificate kafka-test1007.eqiad.wmnet valid until 2022-01-29 19:16:00 +0000 (expires in 11 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [10:41:23] (03PS1) 10Ayounsi: Atlas exporter: add probes and traceroute mesurements [puppet] - 10https://gerrit.wikimedia.org/r/754880 (https://phabricator.wikimedia.org/T251156) [10:41:29] ACKNOWLEDGEMENT - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: Marostegui known https://wikitech.wikimedia.org/wiki/HAProxy [10:43:46] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:43:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [10:44:43] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar), 10User-ema: Package and deploy Varnish 6.0.9 - https://phabricator.wikimedia.org/T298758 (10MMandere) We,ve analyzed `cp3052` and `cp3053` (text and upload nodes respectively) and compared the following resources * Cache hits * Request R... [10:44:45] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/pcc-worker1001/33290/alert1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/753000 (owner: 10Ayounsi) [10:45:42] PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [10:46:06] (03CR) 10Ayounsi: [C: 03+2] Add msw2-eqiad to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/753000 (owner: 10Ayounsi) [10:46:24] !log gnt-cluster upgrade --to 2.16 for ganeti/eqiad cluster [10:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:55] (03Merged) 10jenkins-bot: redfish: improve support for DRY-RUN mode [software/spicerack] - 10https://gerrit.wikimedia.org/r/749852 (owner: 10Volans) [10:47:08] (03PS1) 10Filippo Giunchedi: hieradata: use / as miscweb health check [puppet] - 10https://gerrit.wikimedia.org/r/754881 (https://phabricator.wikimedia.org/T291946) [10:49:23] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/pcc-worker1002/33291/netmon1002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/754880 (https://phabricator.wikimedia.org/T251156) (owner: 10Ayounsi) [10:49:24] PROBLEM - ganeti-wconfd running on ganeti1009 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [10:50:46] PROBLEM - ganeti-confd running on ganeti1012 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [10:50:52] RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [10:52:04] ^ these are expected, the daemons are stopped during the update to 2.16, will recover soon [10:52:14] PROBLEM - ganeti-noded running on ganeti1009 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [10:53:20] RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [10:53:52] (03PS1) 10Kormat: switchover: Drop tendril support. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/754882 (https://phabricator.wikimedia.org/T297605) [10:53:54] RECOVERY - ganeti-confd running on ganeti1012 is OK: PROCS OK: 1 process with UID = 113 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [10:55:26] PROBLEM - Check unit status of netbox_ganeti_eqiad_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:55:43] (03PS1) 10Ayounsi: LibreNMS report only count devices with no IP [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/754883 [10:55:54] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:56:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1117.eqiad.wmnet with OS bullseye [10:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:26] (03CR) 10Kormat: [C: 03+2] switchover: Drop tendril support. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/754882 (https://phabricator.wikimedia.org/T297605) (owner: 10Kormat) [10:57:20] PROBLEM - ganeti-noded running on ganeti1012 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [10:57:32] PROBLEM - ganeti-confd running on ganeti1012 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [10:58:10] PROBLEM - HTTPS Ganeti RAPI eqiad on ganeti1009 is CRITICAL: connect to address ganeti01.svc.eqiad.wmnet and port 5080: Connection refused https://www.mediawiki.org/wiki/Ganeti%23RAPI_daemon [10:58:20] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:58:49] (03Merged) 10jenkins-bot: switchover: Drop tendril support. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/754882 (https://phabricator.wikimedia.org/T297605) (owner: 10Kormat) [10:59:01] (03PS1) 10Marostegui: Revert "db1117: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/754604 [10:59:58] RECOVERY - ganeti-noded running on ganeti1012 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [11:00:08] (03PS1) 10Alexandros Kosiaris: ttyS0-115200: Add a comment about this being VM specific [puppet] - 10https://gerrit.wikimedia.org/r/754884 [11:00:10] RECOVERY - haproxy failover on dbproxy1015 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:00:12] RECOVERY - ganeti-confd running on ganeti1012 is OK: PROCS OK: 1 process with UID = 113 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [11:00:35] (03PS1) 10Elukey: role::pki::multirootca: add dedicated profile for ml-serve k8s [puppet] - 10https://gerrit.wikimedia.org/r/754885 (https://phabricator.wikimedia.org/T298976) [11:00:38] PROBLEM - ganeti-confd running on ganeti1009 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [11:01:40] RECOVERY - ganeti-wconfd running on ganeti1009 is OK: PROCS OK: 1 process with UID = 114 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [11:02:06] (03CR) 10Marostegui: [C: 03+2] Revert "db1117: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/754604 (owner: 10Marostegui) [11:02:10] RECOVERY - ganeti-confd running on ganeti1009 is OK: PROCS OK: 1 process with UID = 113 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [11:02:18] RECOVERY - ganeti-noded running on ganeti1009 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [11:02:32] RECOVERY - HTTPS Ganeti RAPI eqiad on ganeti1009 is OK: HTTP OK: Status line output matched 401 - 309 bytes in 0.019 second response time https://www.mediawiki.org/wiki/Ganeti%23RAPI_daemon [11:02:36] (03PS1) 10Elukey: profile::pki::multirootca: add fake profile credentials for ml-serve [labs/private] - 10https://gerrit.wikimedia.org/r/754887 [11:02:47] (03PS1) 10Volans: requests: add support for conn/read timeouts [software/pywmflib] - 10https://gerrit.wikimedia.org/r/754888 [11:02:52] (03CR) 10Elukey: [V: 03+2 C: 03+2] profile::pki::multirootca: add fake profile credentials for ml-serve [labs/private] - 10https://gerrit.wikimedia.org/r/754887 (owner: 10Elukey) [11:04:45] (03CR) 10Volans: LibreNMS report only count devices with no IP (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/754883 (owner: 10Ayounsi) [11:06:10] RECOVERY - Check unit status of netbox_ganeti_eqiad_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:06:28] !log start rolling upgrade to varnish 6.0.9 T298758 [11:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:32] T298758: Package and deploy Varnish 6.0.9 - https://phabricator.wikimedia.org/T298758 [11:06:54] !log running gnt-cluster renew-crypto --new-node-certificates for ganeti/eqiad cluster following 2.16 update [11:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:02] 10SRE, 10Analytics-Radar: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10akosiaris) There is indeed a race condition between `networking.service` and `ifup@ens5.service`. Checked on a couple of VMs that did not exhibit this problem as well as some that di... [11:07:09] (03PS3) 10Jcrespo: mediabackups: Backup s7 media files at codfw [puppet] - 10https://gerrit.wikimedia.org/r/754025 (https://phabricator.wikimedia.org/T262668) [11:08:04] PROBLEM - ganeti-noded running on ganeti1010 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [11:08:04] PROBLEM - ganeti-mond running on ganeti1022 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [11:09:23] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/754880 (https://phabricator.wikimedia.org/T251156) (owner: 10Ayounsi) [11:09:48] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:09:56] RECOVERY - ganeti-noded running on ganeti1010 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [11:10:16] PROBLEM - ganeti-confd running on ganeti1011 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [11:10:58] PROBLEM - ganeti-confd running on ganeti1009 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [11:11:12] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:11:26] RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:11:28] PROBLEM - HTTPS Ganeti RAPI eqiad on ganeti1009 is CRITICAL: connect to address ganeti01.svc.eqiad.wmnet and port 5080: Connection refused https://www.mediawiki.org/wiki/Ganeti%23RAPI_daemon [11:11:56] RECOVERY - ganeti-mond running on ganeti1022 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [11:11:57] (03CR) 10Elukey: "I added https://gerrit.wikimedia.org/r/c/labs/private/+/754887 but I am currently getting an error from pcc:" [puppet] - 10https://gerrit.wikimedia.org/r/754885 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey) [11:12:16] RECOVERY - ganeti-confd running on ganeti1011 is OK: PROCS OK: 1 process with UID = 113 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [11:13:37] (03PS2) 10Ayounsi: LibreNMS report only count devices with no IP [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/754883 [11:14:01] (03CR) 10Ayounsi: "Thanks!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/754883 (owner: 10Ayounsi) [11:14:48] RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:14:51] (03CR) 10Ayounsi: [C: 03+2] LibreNMS report only count devices with no IP [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/754883 (owner: 10Ayounsi) [11:15:18] RECOVERY - ganeti-confd running on ganeti1009 is OK: PROCS OK: 1 process with UID = 113 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [11:15:46] RECOVERY - HTTPS Ganeti RAPI eqiad on ganeti1009 is OK: HTTP OK: Status line output matched 401 - 309 bytes in 0.021 second response time https://www.mediawiki.org/wiki/Ganeti%23RAPI_daemon [11:16:48] (03PS1) 10Elukey: helmfile.d: deploy cert-manager for ml-serve nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/754890 (https://phabricator.wikimedia.org/T298976) [11:17:38] 10SRE, 10Analytics-Radar: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10MoritzMuehlenhoff) >>! In T273026#7627740, @akosiaris wrote: > * Get rid of ifupdown and /etc/network/interfaces and get a proper and modern network interface manager. See T234207. T... [11:18:41] (03PS2) 10Jbond: P:installserver::proxy: switch access logs to syslog [puppet] - 10https://gerrit.wikimedia.org/r/754520 (https://phabricator.wikimedia.org/T298087) [11:19:18] PROBLEM - Disk space on deploy1002 is CRITICAL: DISK CRITICAL - /run/docker/netns/663e8ee211ef is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=deploy1002&var-datasource=eqiad+prometheus/ops [11:19:42] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10MoritzMuehlenhoff) [11:20:01] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10MoritzMuehlenhoff) [11:22:12] (03PS1) 10Ayounsi: Add grafana-worldmap-panel [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/754892 (https://phabricator.wikimedia.org/T251184) [11:22:56] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for Sérgio Lopes - https://phabricator.wikimedia.org/T299353 (10dr0ptp4kt) Approved. [11:24:53] (03PS2) 10Ayounsi: Add grafana-worldmap-panel [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/754892 (https://phabricator.wikimedia.org/T251184) [11:27:34] 10SRE, 10Analytics-Radar: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10akosiaris) >>! In T273026#7627758, @MoritzMuehlenhoff wrote: >>>! In T273026#7627740, @akosiaris wrote: >> * Get rid of ifupdown and /etc/network/interfaces and get a proper and mode... [11:28:15] !log mwscript findBadBlobs.php --wiki=dewiki --revisions 5730218 --mark "T299387" [11:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:19] T299387: Bad revision in German Wikipedia - https://phabricator.wikimedia.org/T299387 [11:29:28] RECOVERY - SSH on restbase2010.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:34:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [11:35:31] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, in terms of cardinality / metric load on Prometheus let's see what happens! Might need to revert but we'll worry about that later" [puppet] - 10https://gerrit.wikimedia.org/r/754880 (https://phabricator.wikimedia.org/T251156) (owner: 10Ayounsi) [11:35:55] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:38:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [11:38:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [11:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [11:39:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [11:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [11:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [11:39:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [11:39:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [11:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T285149)', diff saved to https://phabricator.wikimedia.org/P18766 and previous config saved to /var/cache/conftool/dbconfig/20220118-113916-marostegui.json [11:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:20] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [11:40:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T285149)', diff saved to https://phabricator.wikimedia.org/P18767 and previous config saved to /var/cache/conftool/dbconfig/20220118-114024-marostegui.json [11:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:57] (03PS1) 10Muehlenhoff: scap: No longer install dependencies via Puppet [puppet] - 10https://gerrit.wikimedia.org/r/754894 (https://phabricator.wikimedia.org/T298463) [11:44:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [11:46:04] !log Rolled back Quibble 1.3.0 jobs due to php configuration files with at least releng/quibble-buster73:1.3.0 # T299389 [11:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:07] T299389: Wikibase CI broken due to missing PHP extensions: dom, intl, mbstring, xml, xmlreader, xmlwriter - https://phabricator.wikimedia.org/T299389 [11:48:30] (03PS2) 10Muehlenhoff: scap: No longer install dependencies via Puppet [puppet] - 10https://gerrit.wikimedia.org/r/754894 (https://phabricator.wikimedia.org/T298463) [11:50:41] (03PS1) 10Kosta Harlan: Monitoring: Add '.Save' to distinguish from '.Click' events [extensions/GrowthExperiments] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/754605 (https://phabricator.wikimedia.org/T286366) [11:50:44] (03PS1) 10Giuseppe Lavagetto: mediawiki-httpd: add and configure mod_remoteip [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/754897 (https://phabricator.wikimedia.org/T297613) [11:52:55] (03CR) 10Jbond: [C: 04-1] "you will first need to create the CA on the root server and upload the public certificate, see:" [puppet] - 10https://gerrit.wikimedia.org/r/754885 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey) [11:53:08] (03PS1) 10Kosta Harlan: Monitoring: Add '.Save' to distinguish from '.Click' events [extensions/GrowthExperiments] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754906 (https://phabricator.wikimedia.org/T286366) [11:55:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P18768 and previous config saved to /var/cache/conftool/dbconfig/20220118-115529-marostegui.json [11:55:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: Your horoscope predicts another unfortunate UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220118T1200). [12:00:05] kostajh, subbu[m], and nn1l2: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:11] o/ [12:00:19] hi [12:00:26] hi [12:00:34] eight changes, tsk tsk tsk ;) [12:00:38] * urbanecm waves but can't deploy [12:00:42] We may need to wait for the job updates that hashar is doing [12:00:57] I'm also present-ish but would prefer not deploying [12:01:04] I can deploy [12:01:21] * kostajh offers Lucas_WMDE a cookie [12:01:26] hi [12:02:08] (03PS2) 10Jbond: nfs-mounts: Used to store facts between all nodes [puppet] - 10https://gerrit.wikimedia.org/r/754509 (https://phabricator.wikimedia.org/T299390) [12:02:15] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Post-edit dialog: Reload page upon dialog closing for structured tasks [extensions/GrowthExperiments] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/754129 (https://phabricator.wikimedia.org/T299188) (owner: 10Kosta Harlan) [12:02:22] let’s start with kostajh [12:04:27] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] fawiki: Add flow-delete right to eliminators (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753969 (https://phabricator.wikimedia.org/T299223) (owner: 104nn1l2) [12:06:30] (03PS1) 10Vgutierrez: cache::envoy: Set upstream idle timeout [puppet] - 10https://gerrit.wikimedia.org/r/754901 (https://phabricator.wikimedia.org/T271421) [12:06:36] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] commonswiki: Add peerj.com to wgCopyUploadsDomains whitelist (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754612 (https://phabricator.wikimedia.org/T299247) (owner: 104nn1l2) [12:06:51] deploying the commonswiki change while waiting for GrowthExperiments CI [12:06:58] (03PS1) 10Giuseppe Lavagetto: Add bullseye build [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/754902 [12:07:00] (03PS1) 10Giuseppe Lavagetto: Add build2001 as a target [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/754903 [12:07:04] nn1l2: is that change testable without actually uploading a file to Commons? [12:07:18] let me upload a filr [12:07:39] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33294/console" [puppet] - 10https://gerrit.wikimedia.org/r/754901 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [12:07:45] (03Merged) 10jenkins-bot: commonswiki: Add peerj.com to wgCopyUploadsDomains whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754612 (https://phabricator.wikimedia.org/T299247) (owner: 104nn1l2) [12:08:15] nn1l2: alright, the change should be on mwdebug1001 now [12:08:21] please let me know if it works [12:10:09] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache::envoy: Set upstream idle timeout [puppet] - 10https://gerrit.wikimedia.org/r/754901 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [12:10:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P18769 and previous config saved to /var/cache/conftool/dbconfig/20220118-121034-marostegui.json [12:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:08] There was a problem during the HTTP request: 432 [12:12:16] test failed [12:12:30] I couldn't upload it [12:13:03] 10SRE, 10SRE-Access-Requests: NRodriguez uses the same SSH key(s) in WMCS and production - https://phabricator.wikimedia.org/T299336 (10Jelto) p:05Triage→03Medium a:03NRodriguez [12:13:30] (03CR) 10Jbond: [C: 03+2] nfs-mounts: Used to store facts between all nodes [puppet] - 10https://gerrit.wikimedia.org/r/754509 (https://phabricator.wikimedia.org/T299390) (owner: 10Jbond) [12:13:45] o_O 432 doesn’t seem to be a known HTTP error code [12:13:59] * Lucas_WMDE peeks at logstash [12:14:37] nn1l2: did you use the WikimediaDebug extension? [12:14:46] yes [12:15:07] weird, I only see one log event in the mwdebug logstash dashboard [12:15:12] usually there are more events even for regular pageviews [12:15:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:15:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:47] this is a new error [12:15:49] hm, the one event is for commons Special:Upload though, so that is probably your request [12:15:57] I have not seen it before [12:16:10] can you try again, maybe? [12:16:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:16:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:32] still the same weird error: There was a problem during the HTTP request: 432 [12:18:08] very weird [12:18:57] I think I’ll sync this change anyways [12:19:07] it seems unlikely that this error is due to the addition [12:19:08] okay [12:19:15] I feel like it’s more likely that mwdebug1001 has errors in general [12:19:21] hm, maybe you could try mwdebug1002? [12:19:27] yes [12:20:24] kostajh: your other backport (add .save to distinguish…) failed gate-and-submit on master, can you retry that? [12:20:37] I’d prefer not to merge the backport before it’s successfully gone into master [12:20:54] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/754893 [12:23:30] Lucas_WMDE: yeah I +2'ed already [12:23:40] ok thanks [12:25:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T285149)', diff saved to https://phabricator.wikimedia.org/P18770 and previous config saved to /var/cache/conftool/dbconfig/20220118-122538-marostegui.json [12:25:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [12:25:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [12:25:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:43] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [12:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T285149)', diff saved to https://phabricator.wikimedia.org/P18771 and previous config saved to /var/cache/conftool/dbconfig/20220118-122546-marostegui.json [12:25:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:55] nn1l2: any news? [12:25:56] 10SRE, 10SRE-Access-Requests: Requesting access to Superset for Margeigh Novotny - https://phabricator.wikimedia.org/T299072 (10Jelto) p:05Triage→03Medium @MNovotny_WMF this access request needs some more information before we can proceed. The mentioned `data access` is a bit broad. Please clarify what da... [12:26:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Prod-Kubernetes, and 3 others: decommission kubestage100[12]-eqiad - https://phabricator.wikimedia.org/T299142 (10Aklapper) [12:26:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T285149)', diff saved to https://phabricator.wikimedia.org/P18772 and previous config saved to /var/cache/conftool/dbconfig/20220118-122654-marostegui.json [12:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:13] !log imported docker-report bullseye rebuild to apt.wikimedia.org T298463 [12:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:16] T298463: Setup a new build host based on bullseye - https://phabricator.wikimedia.org/T298463 [12:27:51] It says "Copy uploads are not available from this domain. [12:27:51] " [12:28:00] test failed again [12:28:04] ok, that makes sense at least [12:28:07] I’ll just sync it [12:28:11] and then you can try it without debug [12:29:30] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:754612|commonswiki: Add peerj.com to wgCopyUploadsDomains whitelist (T299247)]] (duration: 00m 51s) [12:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:34] T299247: Add peerj.com to Commons wgCopyUploadsDomains whitelist - https://phabricator.wikimedia.org/T299247 [12:29:42] (03Merged) 10jenkins-bot: Post-edit dialog: Reload page upon dialog closing for structured tasks [extensions/GrowthExperiments] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/754129 (https://phabricator.wikimedia.org/T299188) (owner: 10Kosta Harlan) [12:30:01] alright let’s do the first GrowthExperiments backport [12:30:09] nn1l2: please let me know if it works now in the meantime [12:30:32] it live now? [12:30:43] it's live now? [12:31:00] it should be on all wikis, yes [12:31:16] kostajh: the PostEdit JS change should be on mwdebug1001 now, can you test it? [12:31:23] Yes looking [12:32:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:33:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:17] It still fails [12:33:29] with the same weird error: There was a problem during the HTTP request: 432 [12:33:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [12:34:04] weird [12:34:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:34:27] I guess that could mean that peerj.com responds with HTTP 432 to the request MediaWiki makes? [12:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:36] Lucas_WMDE: looks good, we can sync it [12:34:42] ok thanks [12:35:32] syncing [12:36:11] nn1l2: looks like someone on StackOverflow got the same error https://stackoverflow.com/questions/70718078/download-pdf-from-peerj [12:36:18] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.17/extensions/GrowthExperiments/modules/ext.growthExperiments.PostEdit/index.js: Backport: [[gerrit:754129|Post-edit dialog: Reload page upon dialog closing for structured tasks (T299188)]] (duration: 00m 51s) [12:36:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:22] T299188: [testwiki-wmf.17] Add image - post-edit dialog option "Close and edit this article again" displays "Suggestions are no longer available on this article" - https://phabricator.wikimedia.org/T299188 [12:36:39] sounds like PeerJ are just blocking requests without the right Referer/User-Agent/who-knows-which-one request header, with an unassigned HTTP error code? [12:37:13] which is a shame but ultimately their problem I’d say [12:37:24] and we should probably remove the domain again until they go and make their website friendlier [12:37:30] so we should revert? [12:37:44] I think so yeah [12:37:56] unless this request came from PeerJ people or we have some contacts there? [12:38:09] so that there would be a reasonable chance of getting this resolved on their end [12:38:22] no, it was just a wikipedian [12:38:48] I decline the phab request [12:38:59] thanks, and please paste the error there [12:39:09] revert doesn’t have to be in this window, let’s see how it plays out [12:39:26] that's good [12:39:35] let us wait at least 24 hours [12:40:00] (03PS1) 10Jbond: O:puppet_compiler: mount yaml dir [puppet] - 10https://gerrit.wikimedia.org/r/754904 [12:40:12] (03PS2) 10Lucas Werkmeister (WMDE): azwiki: Add draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754613 (https://phabricator.wikimedia.org/T299332) (owner: 104nn1l2) [12:40:39] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33295/console" [puppet] - 10https://gerrit.wikimedia.org/r/754904 (owner: 10Jbond) [12:40:57] (03CR) 10Jbond: [V: 03+1 C: 03+2] O:puppet_compiler: mount yaml dir [puppet] - 10https://gerrit.wikimedia.org/r/754904 (owner: 10Jbond) [12:41:41] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] azwiki: Add draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754613 (https://phabricator.wikimedia.org/T299332) (owner: 104nn1l2) [12:41:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P18773 and previous config saved to /var/cache/conftool/dbconfig/20220118-124159-marostegui.json [12:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:27] (03Merged) 10jenkins-bot: azwiki: Add draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754613 (https://phabricator.wikimedia.org/T299332) (owner: 104nn1l2) [12:43:48] nn1l2: meanwhile, the azwiki draft namespace should be on mwdebug1001, can you test it? [12:43:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [12:43:55] (I’m checking it as well) [12:44:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:01] LGTM [12:45:15] alright [12:45:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:45:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:45:51] syncing [12:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:20] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Monitoring: Add '.Save' to distinguish from '.Click' events [extensions/GrowthExperiments] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/754605 (https://phabricator.wikimedia.org/T286366) (owner: 10Kosta Harlan) [12:46:25] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Monitoring: Add '.Save' to distinguish from '.Click' events [extensions/GrowthExperiments] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754906 (https://phabricator.wikimedia.org/T286366) (owner: 10Kosta Harlan) [12:46:37] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:754613|azwiki: Add draft namespace (T299332)]] (duration: 00m 51s) [12:46:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:40] T299332: Add draft namespace on Azerbaijani Wikipedia - https://phabricator.wikimedia.org/T299332 [12:47:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:12] nn1l2: if you want to update https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/753969 (or convince me the comment shouldn’t be split :P) we can deploy that as well [12:47:20] while waiting for the gate-and-submit in GrowthExperiments [12:47:52] subbu[m]: are you around btw? [12:48:11] give a min and I'll upload a patch [12:48:20] ok [12:48:24] hm, subbu’s changes were already merged [12:48:27] * Lucas_WMDE checks SAL [12:48:52] ok, already deployed this morning [12:49:00] nothing to do there I guess 🤷 [12:50:16] (I added a comment to the PeerJ task btw) [12:50:49] (03PS3) 104nn1l2: fawiki: Add flow-delete right to eliminators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753969 (https://phabricator.wikimedia.org/T299223) [12:51:08] uploaded [12:51:17] thx, lgtm [12:51:22] waiting for CI there [12:51:39] (03PS1) 10Jbond: O:puppet_compiler: mount yaml dir [puppet] - 10https://gerrit.wikimedia.org/r/754927 [12:52:00] !log installing ghostcript security updates for stretch [12:52:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:03] (03CR) 104nn1l2: fawiki: Add flow-delete right to eliminators (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753969 (https://phabricator.wikimedia.org/T299223) (owner: 104nn1l2) [12:52:08] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33296/console" [puppet] - 10https://gerrit.wikimedia.org/r/754927 (owner: 10Jbond) [12:52:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:20] (03CR) 10Jbond: [V: 03+1 C: 03+2] O:puppet_compiler: mount yaml dir [puppet] - 10https://gerrit.wikimedia.org/r/754927 (owner: 10Jbond) [12:53:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:53:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:19] (03PS4) 10Lucas Werkmeister (WMDE): fawiki: Add flow-delete right to eliminators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753969 (https://phabricator.wikimedia.org/T299223) (owner: 104nn1l2) [12:54:23] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] fawiki: Add flow-delete right to eliminators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753969 (https://phabricator.wikimedia.org/T299223) (owner: 104nn1l2) [12:54:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:01] Lucas_WMDE: it finished merging into master [12:56:11] ack, thanks [12:56:22] (I already +2ed the backports on the assumption it wouldn’t fail again) [12:56:26] (03Merged) 10jenkins-bot: fawiki: Add flow-delete right to eliminators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753969 (https://phabricator.wikimedia.org/T299223) (owner: 104nn1l2) [12:56:28] Zuul says 9 more minutes [12:56:51] nn1l2: fawiki change is on mwdebug1001, can you test it? [12:57:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P18774 and previous config saved to /var/cache/conftool/dbconfig/20220118-125703-marostegui.json [12:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:45] Good to go [12:57:47] https://fa.wikipedia.org/w/index.php?title=%D9%88%DB%8C%DA%98%D9%87:%D8%A7%D8%AE%D8%AA%DB%8C%D8%A7%D8%B1%D8%A7%D8%AA_%DA%AF%D8%B1%D9%88%D9%87%E2%80%8C%D9%87%D8%A7%DB%8C_%DA%A9%D8%A7%D8%B1%D8%A8%D8%B1%DB%8C&uselang=en looks correct to me [12:57:49] ok [12:58:18] syncing [12:58:42] jouncebot: next [12:58:42] In 3 hour(s) and 1 minute(s): CI server restart (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220118T1600) [12:58:57] ok, we’ll overrun the window a bit for the last GrowthExperiments backports but should be okay [12:59:06] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:753969|fawiki: Add flow-delete right to eliminators (T299223)]] (duration: 00m 51s) [12:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:10] T299223: Add flow-delete right to eliminators on fawiki - https://phabricator.wikimedia.org/T299223 [12:59:32] (03PS1) 10Ayounsi: Update requirements [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/754929 [12:59:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:59:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:25] (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/754929 (owner: 10Ayounsi) [13:01:41] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Update requirements [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/754929 (owner: 10Ayounsi) [13:02:10] only waiting for those gate-and-submit builds now [13:02:22] (is the correct plural gate-and-submits or gates-and-submit 🤔) [13:02:54] !log ayounsi@deploy1002 Started deploy [homer/deploy@0f02386]: update requirements [13:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:09] keine ahnung [13:03:26] on second thought gates-and-submit sounds like a microsoft joke from the early 2000s [13:04:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [13:04:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [13:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:22] !log ayounsi@deploy1002 Finished deploy [homer/deploy@0f02386]: update requirements (duration: 01m 27s) [13:04:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [13:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:17] !log ayounsi@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet with reason: update requirements - ayounsi@cumin1001 [13:05:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:06] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet with reason: update requirements - ayounsi@cumin1001 [13:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:50] (03PS3) 10Giuseppe Lavagetto: changeprop/api-gateway: use the common_images data structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/730559 (https://phabricator.wikimedia.org/T291530) [13:12:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T285149)', diff saved to https://phabricator.wikimedia.org/P18775 and previous config saved to /var/cache/conftool/dbconfig/20220118-131208-marostegui.json [13:12:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [13:12:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [13:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:13] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [13:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T285149)', diff saved to https://phabricator.wikimedia.org/P18776 and previous config saved to /var/cache/conftool/dbconfig/20220118-131215-marostegui.json [13:12:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:05] !log installing python-babel security updates on buster [13:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:47] kostajh: looks like the backports are about to merge [13:17:32] (03Merged) 10jenkins-bot: Monitoring: Add '.Save' to distinguish from '.Click' events [extensions/GrowthExperiments] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/754605 (https://phabricator.wikimedia.org/T286366) (owner: 10Kosta Harlan) [13:17:35] (03Merged) 10jenkins-bot: Monitoring: Add '.Save' to distinguish from '.Click' events [extensions/GrowthExperiments] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754906 (https://phabricator.wikimedia.org/T286366) (owner: 10Kosta Harlan) [13:17:37] I’m guessing the wmf.18 one won’t really be testable on its own [13:17:40] good prediction [13:17:42] but wmf.17 should be okay hopefully [13:17:45] yeah nothing to test for wmf.18 [13:18:02] I can try a test for wmf.17, sure [13:18:41] ok, wmf.17 should be on mwdebug1001 now [13:20:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [13:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:20] Lucas_WMDE: thanks, looking [13:20:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T285149)', diff saved to https://phabricator.wikimedia.org/P18777 and previous config saved to /var/cache/conftool/dbconfig/20220118-132026-marostegui.json [13:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:33] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [13:21:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [13:21:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [13:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [13:22:42] Lucas_WMDE: it works [13:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:49] ack [13:24:39] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.17/extensions/GrowthExperiments/includes/HomepageHooks.php: Backport: [[gerrit:754605|Monitoring: Add '.Save' to distinguish from '.Click' events (T286366)]] (duration: 00m 54s) [13:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:43] T286366: Implement product key performance indicator monitoring for Growth features in Grafana - https://phabricator.wikimedia.org/T286366 [13:25:08] hrm, php-1.38.0-wmf.18 does not exist yet [13:25:17] on deploy1002 [13:25:40] I think that means it’s okay to leave it alone and it’ll make it into the train automatically [13:26:02] but pinging the conductors just in case [13:26:23] jeena, twentyafterfour: I merged the wmf.18 backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/754906 before wmf.18 existed on deploy1002, I hope that’s okay [13:26:59] !log UTC morning backport window done [13:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [13:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [13:28:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [13:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:34] thank you Lucas_WMDE [13:29:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [13:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:04] (03CR) 10Filippo Giunchedi: [C: 03+1] Add grafana-worldmap-panel [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/754892 (https://phabricator.wikimedia.org/T251184) (owner: 10Ayounsi) [13:32:13] (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [13:32:55] np :) [13:33:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [13:34:45] ^ eek [13:35:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P18778 and previous config saved to /var/cache/conftool/dbconfig/20220118-133531-marostegui.json [13:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:35] looks like a spike every hour or so [13:37:13] (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [13:40:24] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Add grafana-worldmap-panel [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/754892 (https://phabricator.wikimedia.org/T251184) (owner: 10Ayounsi) [13:43:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [13:45:43] Lucas_WMDE: I think that's fine [13:46:16] !log add grafana-plugins 0.3 (with worldmap plugin) to reprepo [13:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:04] (03PS1) 10Kormat: Prepare for 0.8 release. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/754935 [13:47:32] (03PS2) 10Kormat: Prepare for 0.8 release. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/754935 (https://phabricator.wikimedia.org/T297605) [13:49:19] (03CR) 10Elukey: role::pki::multirootca: add dedicated profile for ml-serve k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/754885 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey) [13:50:25] (03CR) 10Kormat: [C: 03+2] Prepare for 0.8 release. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/754935 (https://phabricator.wikimedia.org/T297605) (owner: 10Kormat) [13:50:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P18779 and previous config saved to /var/cache/conftool/dbconfig/20220118-135036-marostegui.json [13:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:00] (03Merged) 10jenkins-bot: Prepare for 0.8 release. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/754935 (https://phabricator.wikimedia.org/T297605) (owner: 10Kormat) [13:53:37] (03PS1) 10Vgutierrez: cache::envoy: Decrease upstream idle_timeout to 30s [puppet] - 10https://gerrit.wikimedia.org/r/754938 (https://phabricator.wikimedia.org/T271421) [13:55:05] (03CR) 10Vgutierrez: [C: 03+2] cache::envoy: Decrease upstream idle_timeout to 30s [puppet] - 10https://gerrit.wikimedia.org/r/754938 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [13:55:30] !log update grafana-plugins on grafana hosts - T251184 [13:55:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:33] T251184: Add Grafana worldmap panel - https://phabricator.wikimedia.org/T251184 [13:58:40] (03PS2) 10JMeybohm: Update codfw kubernetes master to a full node [puppet] - 10https://gerrit.wikimedia.org/r/754556 (https://phabricator.wikimedia.org/T290967) [14:05:12] (03CR) 10Ottomata: P:installserver::proxy: Add domain whitelist to proxy (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [14:05:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T285149)', diff saved to https://phabricator.wikimedia.org/P18780 and previous config saved to /var/cache/conftool/dbconfig/20220118-140540-marostegui.json [14:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:51] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [14:06:38] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nick Ray - https://phabricator.wikimedia.org/T299186 (10Ottomata) Approved [14:06:50] (03CR) 10Jbond: [V: 03+1 C: 04-1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33298/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/754885 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey) [14:07:56] (03CR) 10Jbond: [V: 03+1 C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/754885 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey) [14:10:06] !log installing vim security updates on stretch [14:10:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:13] (03PS1) 10JMeybohm: Add keys needed for k8s node profile to master nodes [labs/private] - 10https://gerrit.wikimedia.org/r/754943 [14:14:42] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add keys needed for k8s node profile to master nodes [labs/private] - 10https://gerrit.wikimedia.org/r/754943 (owner: 10JMeybohm) [14:15:01] (03PS3) 10JMeybohm: Update codfw kubernetes master to a full node [puppet] - 10https://gerrit.wikimedia.org/r/754556 (https://phabricator.wikimedia.org/T290967) [14:16:45] (03PS3) 10Arturo Borrero Gonzalez: wmcs: factorize common arguments [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/754473 [14:16:47] (03PS2) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: introduce cookbook to repool a node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/754555 (https://phabricator.wikimedia.org/T298948) [14:16:49] (03PS1) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: introduce cookbook to verify basic grid health [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/754944 (https://phabricator.wikimedia.org/T298948) [14:18:19] 10SRE-swift-storage, 10Data-Engineering, 10Data-Engineering-Kanban: Deploy research_poc Swift credidentials to Hadoop - https://phabricator.wikimedia.org/T296945 (10Ottomata) Hm, perhaps, although I'm not sure where. This is sort of a one off. We'd love to have more first class support for exporting to swi... [14:18:28] 10SRE-swift-storage, 10Data-Engineering, 10Data-Engineering-Kanban: Deploy research_poc Swift credidentials to Hadoop - https://phabricator.wikimedia.org/T296945 (10Ottomata) 05Open→03Resolved [14:18:32] 10SRE-swift-storage: Storage request for datasets published by research team - https://phabricator.wikimedia.org/T294380 (10Ottomata) [14:19:04] (03PS1) 10JMeybohm: Add kubestagemaster2001 to k8s_staging iBGP config [homer/public] - 10https://gerrit.wikimedia.org/r/754945 (https://phabricator.wikimedia.org/T290967) [14:19:51] (03CR) 10jerkins-bot: [V: 04-1] Add kubestagemaster2001 to k8s_staging iBGP config [homer/public] - 10https://gerrit.wikimedia.org/r/754945 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [14:20:00] (03CR) 10jerkins-bot: [V: 04-1] wmcs: toolforge: grid: introduce cookbook to verify basic grid health [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/754944 (https://phabricator.wikimedia.org/T298948) (owner: 10Arturo Borrero Gonzalez) [14:21:16] (03PS2) 10JMeybohm: Add kubestagemaster2001 to k8s_staging iBGP config [homer/public] - 10https://gerrit.wikimedia.org/r/754945 (https://phabricator.wikimedia.org/T290967) [14:25:54] 10SRE, 10SRE-Access-Requests: Requesting access to Superset for Margeigh Novotny - https://phabricator.wikimedia.org/T299072 (10cmooney) @Jelto thanks for picking this up. I disucssed briefly with Margeigh on Slack and she confirmed she needs access to dashboards with private data in Superset. So I believe w... [14:27:21] (03PS1) 10Jbond: hieradata pcc: add deployment prep [puppet] - 10https://gerrit.wikimedia.org/r/754948 [14:27:54] (03CR) 10Jbond: [C: 03+2] hieradata pcc: add deployment prep [puppet] - 10https://gerrit.wikimedia.org/r/754948 (owner: 10Jbond) [14:28:01] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor pedantic nit, otherwise +1" [homer/public] - 10https://gerrit.wikimedia.org/r/754945 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [14:28:05] 10SRE, 10SRE-Access-Requests: Requesting access to analytics clients for mfossati - https://phabricator.wikimedia.org/T299343 (10Ottomata) Approved! [14:28:48] !log installing xorg-server security updates on stretch [14:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:14] (03CR) 10Ottomata: ":)" [puppet] - 10https://gerrit.wikimedia.org/r/753738 (owner: 10Elukey) [14:29:25] (03PS4) 10JMeybohm: Update codfw kubernetes master to a full node [puppet] - 10https://gerrit.wikimedia.org/r/754556 (https://phabricator.wikimedia.org/T290967) [14:30:41] (03PS1) 10Jbond: O:uppetmaster::standalone: Add upload_facts parameter [puppet] - 10https://gerrit.wikimedia.org/r/754949 [14:31:42] !log installing rsync security updates on stretch [14:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:06] !log uploaded wmfmariadbpy 0.8 to apt.wm.o [14:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:18] !log Deploying wmfmariadbpy 0.8 T299406 [14:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:21] T299406: Deploy wmfmariadbpy 0.8 - https://phabricator.wikimedia.org/T299406 [14:33:26] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33301/console" [puppet] - 10https://gerrit.wikimedia.org/r/754556 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [14:33:33] 10SRE, 10Data-Engineering: Allow kafka brokers to reload the TLS keystore - https://phabricator.wikimedia.org/T299409 (10elukey) [14:33:35] (03CR) 10Jbond: [C: 03+2] O:uppetmaster::standalone: Add upload_facts parameter [puppet] - 10https://gerrit.wikimedia.org/r/754949 (owner: 10Jbond) [14:34:53] (03PS3) 10JMeybohm: Add kubestagemaster2001 to k8s_staging eBGP config [homer/public] - 10https://gerrit.wikimedia.org/r/754945 (https://phabricator.wikimedia.org/T290967) [14:35:58] (03CR) 10JMeybohm: Add kubestagemaster2001 to k8s_staging eBGP config (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/754945 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [14:36:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [14:41:32] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nick Ray - https://phabricator.wikimedia.org/T299186 (10Jelto) [14:46:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [14:49:52] (03PS4) 10Eigyan: [wmf-config] Deploy the cawiki test safety survey to production. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753543 (https://phabricator.wikimedia.org/T296657) [14:55:12] (03PS1) 10Jelto: admin: Shell account and analytics-privatedata-users for nray [puppet] - 10https://gerrit.wikimedia.org/r/754954 (https://phabricator.wikimedia.org/T299186) [14:55:18] (03CR) 10JMeybohm: role::pki::multirootca: add dedicated profile for ml-serve k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/754885 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey) [14:56:50] (03CR) 10Elukey: role::pki::multirootca: add dedicated profile for ml-serve k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/754885 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey) [14:57:09] jouncebot: now [14:57:09] No deployments scheduled for the next 1 hour(s) and 2 minute(s) [14:57:12] jouncebot: next [14:57:12] In 1 hour(s) and 2 minute(s): CI server restart (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220118T1600) [14:58:17] 10SRE, 10SRE-Access-Requests: Requesting access to analytics clients for mfossati - https://phabricator.wikimedia.org/T299343 (10Jelto) [14:58:27] (03CR) 10JMeybohm: [C: 03+1] helmfile.d: deploy cert-manager for ml-serve nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/754890 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey) [15:02:19] (03PS1) 10Jelto: admin: Shell account and analytics-privatedata-users for mfossati [puppet] - 10https://gerrit.wikimedia.org/r/754955 (https://phabricator.wikimedia.org/T299343) [15:04:12] (03PS1) 10Kormat: dbutil: read_section_ports_list() bug when path not supplied [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/754956 [15:06:20] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/754957 [15:06:46] (03PS1) 10Jbond: C:cfssl:signer: update default expiry [puppet] - 10https://gerrit.wikimedia.org/r/754958 [15:07:28] (03CR) 10Jbond: [V: 03+1 C: 03+1] role::pki::multirootca: add dedicated profile for ml-serve k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/754885 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey) [15:07:30] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33302/console" [puppet] - 10https://gerrit.wikimedia.org/r/754958 (owner: 10Jbond) [15:09:01] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:cfssl:signer: update default expiry [puppet] - 10https://gerrit.wikimedia.org/r/754958 (owner: 10Jbond) [15:09:54] (03CR) 10Elukey: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/754945 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [15:12:28] (03PS1) 10AOkoth: kuberenetes: disable mwautopull timer [puppet] - 10https://gerrit.wikimedia.org/r/754960 (https://phabricator.wikimedia.org/T288345) [15:14:50] (03CR) 10AOkoth: "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33303/console" [puppet] - 10https://gerrit.wikimedia.org/r/754960 (https://phabricator.wikimedia.org/T288345) (owner: 10AOkoth) [15:15:06] (03CR) 10Klausman: [C: 03+1] dbutil: read_section_ports_list() bug when path not supplied [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/754956 (owner: 10Kormat) [15:16:17] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add kubestagemaster2001 to k8s_staging eBGP config [homer/public] - 10https://gerrit.wikimedia.org/r/754945 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [15:18:53] (03CR) 10Kormat: [C: 03+2] dbutil: read_section_ports_list() bug when path not supplied [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/754956 (owner: 10Kormat) [15:20:52] (03CR) 10JMeybohm: kuberenetes: disable mwautopull timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/754960 (https://phabricator.wikimedia.org/T288345) (owner: 10AOkoth) [15:21:36] (03Merged) 10jenkins-bot: dbutil: read_section_ports_list() bug when path not supplied [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/754956 (owner: 10Kormat) [15:25:10] (03CR) 10Elukey: [C: 03+1] "LGTM, I didn't spot anything strange. Couple of notes from IRC:" [puppet] - 10https://gerrit.wikimedia.org/r/754556 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [15:27:50] (03CR) 10Jbond: "looks good but see nit" [puppet] - 10https://gerrit.wikimedia.org/r/754954 (https://phabricator.wikimedia.org/T299186) (owner: 10Jelto) [15:29:36] (03CR) 10Elukey: [C: 03+2] role::pki::multirootca: add dedicated profile for ml-serve k8s [puppet] - 10https://gerrit.wikimedia.org/r/754885 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey) [15:32:29] (03CR) 10Jbond: [C: 04-1] "-1 see comment i think this will need `krb: present`" [puppet] - 10https://gerrit.wikimedia.org/r/754955 (https://phabricator.wikimedia.org/T299343) (owner: 10Jelto) [15:34:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [15:35:04] (03CR) 10Elukey: [C: 03+2] helmfile.d: deploy cert-manager for ml-serve nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/754890 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey) [15:35:11] !log regenerate kartotherian certs via cergen - T297604 [15:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:15] T297604: cergen should include the cert's name in SAN too - https://phabricator.wikimedia.org/T297604 [15:35:17] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=cfssl site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:36:31] PROBLEM - Check systemd state on pki1001 is CRITICAL: CRITICAL - degraded: The following units failed: cfssl-multirootca.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:37:39] PROBLEM - Check systemd state on pki2001 is CRITICAL: CRITICAL - degraded: The following units failed: cfssl-multirootca.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:37:42] (03PS5) 10JMeybohm: Update codfw kubernetes master to a full node [puppet] - 10https://gerrit.wikimedia.org/r/754556 (https://phabricator.wikimedia.org/T290967) [15:39:27] (03CR) 10JMeybohm: Update codfw kubernetes master to a full node (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/754556 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [15:40:34] (03CR) 10Eevans: [C: 03+1] partman: use reuse profiles on all restbase hosts [puppet] - 10https://gerrit.wikimedia.org/r/753986 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan) [15:43:18] (03CR) 10Ahmon Dancy: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/754894 (https://phabricator.wikimedia.org/T298463) (owner: 10Muehlenhoff) [15:44:21] (03CR) 10Muehlenhoff: [C: 03+2] scap: No longer install dependencies via Puppet [puppet] - 10https://gerrit.wikimedia.org/r/754894 (https://phabricator.wikimedia.org/T298463) (owner: 10Muehlenhoff) [15:44:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [15:45:13] !log aqu@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) [15:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:16] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 02s) [15:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:27] (03PS2) 10Jelto: admin: Shell account and analytics-privatedata-users for mfossati [puppet] - 10https://gerrit.wikimedia.org/r/754955 (https://phabricator.wikimedia.org/T299343) [15:45:49] (03PS1) 10Filippo Giunchedi: ssl: update kartotherian cert [puppet] - 10https://gerrit.wikimedia.org/r/754968 (https://phabricator.wikimedia.org/T297604) [15:46:55] (03CR) 10Jelto: admin: Shell account and analytics-privatedata-users for mfossati (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/754955 (https://phabricator.wikimedia.org/T299343) (owner: 10Jelto) [15:47:21] (03CR) 10Filippo Giunchedi: [C: 03+2] ssl: update kartotherian cert [puppet] - 10https://gerrit.wikimedia.org/r/754968 (https://phabricator.wikimedia.org/T297604) (owner: 10Filippo Giunchedi) [15:47:43] !log resizing the wikitech-static host for T298052 [15:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:56] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/754902 (owner: 10Giuseppe Lavagetto) [15:48:09] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/754903 (owner: 10Giuseppe Lavagetto) [15:50:01] !log aqu@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) [15:50:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:10] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 09s) [15:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:20] !log installing libssh2 security updates on stretch [15:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:49] !log update kartotherian certs on maps hosts and roll-reload nginx - T297604 [15:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:52] T297604: cergen should include the cert's name in SAN too - https://phabricator.wikimedia.org/T297604 [15:55:16] (03PS1) 10MMandere: cumin: Add cache::upload_envoy to cp aliases [puppet] - 10https://gerrit.wikimedia.org/r/754975 (https://phabricator.wikimedia.org/T271421) [15:56:54] 10SRE, 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))) - https://phabricator.wikimedia.org/T283582 (10Papaul) @hashar let me know when this is offline so i can take over [15:57:18] (03PS2) 10Jbond: P:rsyslog: add squid to the list of programs sent to central log [puppet] - 10https://gerrit.wikimedia.org/r/754521 (https://phabricator.wikimedia.org/T298087) [15:57:25] (03CR) 10jerkins-bot: [V: 04-1] cumin: Add cache::upload_envoy to cp aliases [puppet] - 10https://gerrit.wikimedia.org/r/754975 (https://phabricator.wikimedia.org/T271421) (owner: 10MMandere) [15:57:39] (03CR) 10Jbond: "no need for mtail as there is already squid exporter, as such i think this is it?" [puppet] - 10https://gerrit.wikimedia.org/r/754521 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [15:59:20] (03CR) 10jerkins-bot: [V: 04-1] P:rsyslog: add squid to the list of programs sent to central log [puppet] - 10https://gerrit.wikimedia.org/r/754521 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [15:59:52] !log Shutting down CI for maintenance on contint2001 # T283582 [15:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:56] T283582: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))) - https://phabricator.wikimedia.org/T283582 [16:00:04] papaul and hashar: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) CI server restart deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220118T1600). [16:01:02] (03CR) 10Hnowlan: [C: 03+2] partman: use reuse profiles on all restbase hosts [puppet] - 10https://gerrit.wikimedia.org/r/753986 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan) [16:02:35] 10SRE, 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))) - https://phabricator.wikimedia.org/T283582 (10hashar) @Papaul the machine is shutting down. I am on IRC if you want t... [16:03:07] !log installing xen security updates on buster (client-side libraries) [16:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:47] PROBLEM - Host contint2001 is DOWN: PING CRITICAL - Packet loss = 100% [16:04:35] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Hieradata yaml style checking - https://phabricator.wikimedia.org/T236954 (10jhathaway) > From: @Joe > I'm generally not a big fan of reformatting patches, because of how hard they make to reconstruct git history. However,... [16:07:00] (03PS2) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: introduce cookbook to verify basic grid health [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/754944 (https://phabricator.wikimedia.org/T298948) [16:07:38] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2010.codfw.wmnet with OS buster [16:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:56] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [16:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:40] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [16:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:44] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2010.codfw.wmnet with OS buster [16:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:42] (03PS1) 10Elukey: helmfile.d: add 'cert-manager' namespace to ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/754981 (https://phabricator.wikimedia.org/T298976) [16:13:21] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2010.codfw.wmnet with OS buster [16:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:37] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2010.codfw.wmnet with OS buster [16:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:34] (03PS6) 10DCausse: blazegraph: prometheus exporter may bypass nginx [puppet] - 10https://gerrit.wikimedia.org/r/754523 [16:21:49] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2010.codfw.wmnet with OS buster [16:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:35] (03CR) 10Herron: [C: 03+1] hieradata: use / as miscweb health check [puppet] - 10https://gerrit.wikimedia.org/r/754881 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [16:23:51] (03CR) 10Herron: [C: 03+1] Revert "ProductionServices: use graphite2003 for statsd" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754879 (https://phabricator.wikimedia.org/T299383) (owner: 10Filippo Giunchedi) [16:23:59] (03CR) 10Herron: [C: 03+1] wmnet: move reads to graphite1004 [dns] - 10https://gerrit.wikimedia.org/r/754874 (https://phabricator.wikimedia.org/T299383) (owner: 10Filippo Giunchedi) [16:24:07] (03CR) 10Herron: [C: 03+1] wmnet: move writes to graphite1004 [dns] - 10https://gerrit.wikimedia.org/r/754875 (https://phabricator.wikimedia.org/T299383) (owner: 10Filippo Giunchedi) [16:24:15] (03CR) 10Herron: [C: 03+1] Revert "graphite: check graphite2003 metrics" [puppet] - 10https://gerrit.wikimedia.org/r/754876 (https://phabricator.wikimedia.org/T299383) (owner: 10Filippo Giunchedi) [16:24:31] (03CR) 10Herron: [C: 03+1] Revert "profile: move statsd writes to graphite2003" [puppet] - 10https://gerrit.wikimedia.org/r/754877 (https://phabricator.wikimedia.org/T299383) (owner: 10Filippo Giunchedi) [16:32:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [16:37:06] 10ops-codfw: Possible cable issue on restbase2010 management interface - https://phabricator.wikimedia.org/T299426 (10hnowlan) [16:41:22] (03CR) 10Herron: "This would work to send squid syslog messages to the kafka logging (logstash) pipeline, but wouldn't affect centrallog as the description " [puppet] - 10https://gerrit.wikimedia.org/r/754521 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [16:42:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [16:43:00] 10ops-codfw, 10Lift-Wing: ml-serve2001 logged a corrected memory error - https://phabricator.wikimedia.org/T299427 (10klausman) [16:43:06] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Hieradata yaml style checking - https://phabricator.wikimedia.org/T236954 (10jhathaway) >>! In T236954#7625612, @jbond wrote: > Thanks for the work on this looks really good, in relation to linting vs automatic formatting i... [16:44:02] jouncebot: nowandnext [16:44:02] For the next 0 hour(s) and 15 minute(s): CI server restart (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220118T1600) [16:44:02] In 0 hour(s) and 15 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220118T1700) [16:45:47] !log klausman@cumin2001 START - Cookbook sre.hosts.reboot-single for host ml-serve2001.codfw.wmnet [16:45:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:52] RECOVERY - Host contint2001 is UP: PING OK - Packet loss = 0%, RTA = 31.66 ms [16:45:55] 10SRE, 10ops-codfw, 10Lift-Wing: ml-serve2001 logged a corrected memory error - https://phabricator.wikimedia.org/T299427 (10ops-monitoring-bot) Host rebooted by klausman@cumin2001 with reason: Reboot to clear ECC state in dmesg [16:46:53] 10SRE, 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))) - https://phabricator.wikimedia.org/T283582 (10Papaul) reset IDRAC, uograde BIOS and IDRAC. [16:47:07] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Hieradata yaml style checking - https://phabricator.wikimedia.org/T236954 (10jhathaway) >>! In T236954#7626447, @fgiunchedi wrote: > 100% agreed on consistency, I like the general idea and wanted to say +1 on not removing bl... [16:47:13] (03CR) 10SBassett: [C: 04-1] doc.wikimedia.org CSP: Allow XHR requests to Wikipedia and Wikidata (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/754048 (https://phabricator.wikimedia.org/T285570) (owner: 10Catrope) [16:47:17] (03PS1) 10Btullis: Deploy the dev version of cassandra to aqs1010.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/754988 (https://phabricator.wikimedia.org/T298516) [16:47:32] (03CR) 10jerkins-bot: [V: 04-1] Deploy the dev version of cassandra to aqs1010.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/754988 (https://phabricator.wikimedia.org/T298516) (owner: 10Btullis) [16:47:43] !log hnowlan@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host restbase2010.codfw.wmnet with OS buster [16:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:05] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2010.codfw.wmnet with OS buster [16:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:20] PROBLEM - Check whether ferm is active by checking the default input chain on contint2001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:48:25] (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [16:48:26] PROBLEM - Check systemd state on contint2001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:48:40] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [16:49:28] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2010.codfw.wmnet with OS buster [16:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:52] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM. Key matches the one Nick sent me over WMF Slack." [puppet] - 10https://gerrit.wikimedia.org/r/754954 (https://phabricator.wikimedia.org/T299186) (owner: 10Jelto) [16:51:21] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Nick Ray - https://phabricator.wikimedia.org/T299186 (10cmooney) Nick has sent me his key over Slack (responding to query from last week). I can confirm it matches the one in the Gerrit patch. [16:51:41] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33304/console" [puppet] - 10https://gerrit.wikimedia.org/r/754988 (https://phabricator.wikimedia.org/T298516) (owner: 10Btullis) [16:52:20] !log contint2001: restarted ferm service [16:52:21] (03CR) 10JHathaway: Hieradata: format yaml with vinyl (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/754114 (https://phabricator.wikimedia.org/T236954) (owner: 10JHathaway) [16:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:56] RECOVERY - Check systemd state on contint2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:53:01] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2010.codfw.wmnet with OS buster [16:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:14] !log klausman@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2001.codfw.wmnet [16:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:34] 10SRE, 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))) - https://phabricator.wikimedia.org/T283582 (10hashar) 05Open→03Resolved a:03Papaul I have restarted ferm. Zuul... [16:56:58] 10SRE, 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))) - https://phabricator.wikimedia.org/T283582 (10Papaul) @hashar no problem you can close the task once all is back onli... [16:57:17] (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/754981 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey) [16:58:34] 10SRE, 10ops-codfw, 10Lift-Wing: ml-serve2001 logged a corrected memory error - https://phabricator.wikimedia.org/T299427 (10klausman) `root@ml-serve2001:/sys/devices/system/edac/mc# grep . mc*/*count mc0/ce_count:0 mc0/ce_noinfo_count:0 mc0/ue_count:0 mc0/ue_noinfo_count:0 mc1/ce_count:0 mc1/ce_noinfo_coun... [17:00:04] jbond and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220118T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:03:09] 10SRE, 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))) - https://phabricator.wikimedia.org/T283582 (10Papaul) [17:07:21] 10SRE, 10ops-codfw: Possible cable issue on restbase2010 management interface - https://phabricator.wikimedia.org/T299426 (10Papaul) @hnowlan looks like an IDRAC reset and firmware upgrade too on this server will fix the issue PE R430 purchased in 2016 [17:08:22] (03CR) 10Klausman: [C: 03+1] helmfile.d: add 'cert-manager' namespace to ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/754981 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey) [17:09:06] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/754523 (owner: 10DCausse) [17:09:45] 10SRE, 10ops-codfw, 10Lift-Wing: ml-serve2001 logged a corrected memory error - https://phabricator.wikimedia.org/T299427 (10Papaul) confirmed all green in IDRAC [17:12:46] hashar: hi, is CI supposed to be fine at this point? [17:14:01] (03PS2) 10Andrew Bogott: cloud-vps nfsclient: switch to using the VM-hosted scratch NFS server [puppet] - 10https://gerrit.wikimedia.org/r/754043 (https://phabricator.wikimedia.org/T291405) [17:14:04] (03PS1) 10Andrew Bogott: wmcs nfsclient: remove a long-absented mount [puppet] - 10https://gerrit.wikimedia.org/r/754991 [17:15:45] (03PS1) 10Ppchelko: First pass on creating config-schema.yaml [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754910 [17:16:14] (03PS1) 10Ppchelko: Benchmark loading DefaultSettings from YAML [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754911 [17:16:28] (03PS1) 10Joal: Reset druid load jobs for network_flows_internal [puppet] - 10https://gerrit.wikimedia.org/r/754994 (https://phabricator.wikimedia.org/T263277) [17:16:33] (03PS2) 10Ppchelko: Benchmark loading DefaultSettings from YAML [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754911 [17:16:37] !log installing gmp security updates [17:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:38] RECOVERY - Check whether ferm is active by checking the default input chain on contint2001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:23:59] (03PS3) 10Ppchelko: Benchmark loading DefaultSettings from YAML [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754911 [17:25:37] (03PS1) 10Cwhite: logstash: gitlab: rename service field prior to populating object [puppet] - 10https://gerrit.wikimedia.org/r/754995 [17:26:02] (03PS3) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: introduce cookbook to verify basic grid health [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/754944 (https://phabricator.wikimedia.org/T298948) [17:32:35] urbanecm: according to hashar’s email, “the maintenance is complete”, but you’re not the only one having issues :/ [17:32:48] i can't get jenkins to run anything [17:32:55] maybe something is still broken [17:32:56] PROBLEM - puppet last run on mx1001 is CRITICAL: CRITICAL: Puppet last ran 5 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:32:57] and https://integration.wikimedia.org/zuul/ is empty [17:33:03] yeah I think so [17:33:14] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 6982 MB (19% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [17:33:16] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "I dislike this approach. We need better code/config decoupling." [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/754944 (https://phabricator.wikimedia.org/T298948) (owner: 10Arturo Borrero Gonzalez) [17:33:33] I swear I have seen changes being tested and jobs being triggered [17:34:06] A couple of jobs processed but then everything stopped. [17:34:14] hmm [17:34:31] indeed the zuul scheduler is idle [17:34:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [17:37:38] !log restarted zuul on contint2001 [17:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:50] Jan 18 17:37:32 contint2001 zuul-server[20572]: /srv/deployment/zuul/venv/local/lib/python2.7/site-packages/paramiko/client.py:685: UserWarning: Unknown ssh-rsa host key for [gerrit.wikimedia.org]:29418: dce9687b991b27d0f9fdce6a2ebf92e1 [17:37:52] ... [17:38:06] oh joy [17:38:23] ew. [17:38:42] I have no idea how Gerrit ssh host key might have changed [17:38:58] or the zuul user no more knows about it [17:39:14] if it really did change then I should get a complaint when I try to clone a repo. [17:39:15] [17:39:16] RECOVERY - puppet last run on mx1001 is OK: OK: Puppet is currently enabled, last run 6 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:39:34] No complaint received. [17:40:21] not seeing a changed gerrit host key here either [17:40:24] the scheduler runs as zuul the list of known hosts is in /var/lib/zuul/.ssh/known_hosts last touched in July 2020 [17:40:39] maybe that is false warning ;) [17:40:58] zuul-serv 20572 zuul 17u IPv6 222452 0t0 TCP contint2001.wikimedia.org:54516->gerrit.wikimedia.org:29418 (ESTABLISHED) [17:41:33] (03CR) 10Dzahn: "I can also add the /healthz file. I just did not care so far because I knew it only checks if the port is open. hmm" [puppet] - 10https://gerrit.wikimedia.org/r/754881 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [17:44:00] *sigh* [17:44:01] hashar: It looks happier now. [17:44:16] maybe it failed to connect to gerrit [17:44:26] or the connection dropped when I restarted ferm [17:44:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [17:45:22] 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10RobH) [17:45:48] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10RobH) [17:46:06] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10RobH) a:03Papaul [17:46:08] so I think the sequence is zuul restarted all fine, connected to gerrit and processed changes [17:46:40] I then restarted ferm (firewall stuff) at 16:51:50 UTC [17:47:01] which might have killed the zuul ---(ssh:29418)---> gerrit connection [17:47:04] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7102 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [17:47:06] PROBLEM - Cassandra instance data free space on restbase2011 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 6974 MB (19% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [17:47:13] (03CR) 10Eevans: [C: 03+1] "LGTM; When are you planning to push this out?" [puppet] - 10https://gerrit.wikimedia.org/r/754988 (https://phabricator.wikimedia.org/T298516) (owner: 10Btullis) [17:47:32] 10SRE, 10Observability-Metrics, 10observability, 10Graphite: PHP statsd client doesn't support tagging metrics - https://phabricator.wikimedia.org/T225721 (10lmata) [17:47:51] 10SRE, 10Observability-Metrics, 10WMF-Legal, 10observability, and 2 others: Add license statement to Grafana dashboards - https://phabricator.wikimedia.org/T214819 (10lmata) [17:48:01] and possible flushed all the gearman function which RhinosF1 noticed at some point (job results reported as NOT_REGISTERED) [17:52:27] dancy: yeah looks better thx! [17:52:30] Lucas_WMDE: should be good now [17:52:38] yup, thanks! [17:54:18] hashar: I assume the connection from zuul to gerrit is to receive the events stream? [17:54:36] correct [17:55:00] then if the firewall killed the connection I would expect zuul to notice that and attempt to reconnect [17:55:55] if zuul never sends anything down the connection (other than TCP ACKs), it'll never find out [17:56:01] it'll just think there are no events. [17:56:17] TCP keepalives might help recognize the broken connection sooner. [17:56:53] That said, I'm surprised that connection tracking didn't keep the traffic flowing. [17:57:06] !log restarting blazegraph on wdqs1007 (jvm stuck for 13hours) [17:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:15] But I haven't looked at the ferm rules closely to see how they're arranged. [17:57:17] maybe it is something entirely different [17:57:57] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1007 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:59:07] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:00:04] chrisalbon and accraze: That opportune time is upon us again. Time for a Services – Graphoid / ORES deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220118T1800). [18:01:36] (03CR) 10Ebernhardson: [C: 03+1] blazegraph: prometheus exporter may bypass nginx [puppet] - 10https://gerrit.wikimedia.org/r/754523 (owner: 10DCausse) [18:03:11] 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10RobH) [18:03:46] 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10RobH) [18:04:09] (03CR) 10jerkins-bot: [V: 04-1] blazegraph: prometheus exporter may bypass nginx [puppet] - 10https://gerrit.wikimedia.org/r/754523 (owner: 10DCausse) [18:04:15] 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10RobH) a:03Jclark-ctr [18:04:29] looks like it is fine, away idling again [18:04:33] I am away idling again [18:04:40] (03CR) 10Accraze: [C: 03+1] helmfile.d: add 'cert-manager' namespace to ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/754981 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey) [18:04:44] Thanks hashar! [18:05:04] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/754954 (https://phabricator.wikimedia.org/T299186) (owner: 10Jelto) [18:05:13] (03CR) 10Ppchelko: "recheck" [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754910 (owner: 10Ppchelko) [18:05:23] (03CR) 10Ppchelko: "recheck" [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754911 (owner: 10Ppchelko) [18:06:08] (03PS1) 10Majavah: Drop CentralAuthUserMerge log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754998 (https://phabricator.wikimedia.org/T216089) [18:07:21] (03PS2) 10Majavah: Drop CentralAuthUserMerge log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754998 (https://phabricator.wikimedia.org/T216089) [18:09:07] (03PS1) 10Majavah: Disable UserMerge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754999 (https://phabricator.wikimedia.org/T216089) [18:10:05] jouncebot: nowandnext [18:10:05] For the next 0 hour(s) and 49 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220118T1800) [18:10:05] In 0 hour(s) and 49 minute(s): Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220118T1900) [18:10:11] (03PS2) 10Urbanecm: pwnwiki: Deploy Growth features to newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754504 (https://phabricator.wikimedia.org/T298115) [18:10:21] (03CR) 10Urbanecm: [C: 03+2] pwnwiki: Deploy Growth features to newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754504 (https://phabricator.wikimedia.org/T298115) (owner: 10Urbanecm) [18:11:35] PROBLEM - Cassandra instance data free space on restbase2011 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 6981 MB (19% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [18:11:51] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10Papaul) [18:13:46] (03PS1) 10Jeena Huneidi: testwikis wikis to 1.38.0-wmf.18 refs T293959 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755001 [18:13:51] (03CR) 10Jeena Huneidi: [C: 03+2] testwikis wikis to 1.38.0-wmf.18 refs T293959 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755001 (owner: 10Jeena Huneidi) [18:14:26] (03CR) 10jerkins-bot: [V: 04-1] testwikis wikis to 1.38.0-wmf.18 refs T293959 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755001 (owner: 10Jeena Huneidi) [18:14:49] jeena: I'm sorry, i thought train is in 2 hours and +2'ed a config change of myself [18:15:00] happy to wait though [18:15:23] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7337 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [18:15:50] hashar: CI is still broke [18:16:00] (03Merged) 10jenkins-bot: pwnwiki: Deploy Growth features to newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754504 (https://phabricator.wikimedia.org/T298115) (owner: 10Urbanecm) [18:16:06] damn [18:16:07] not fully at least [18:16:10] I'm getting errors on sonar and freshnel builds [18:16:27] links? [18:16:29] but the config patch took more than i'd want it to, and the failure at jeena's patch is weird too [18:16:34] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/754909/7 [18:16:43] hashar: pretty sure sonar / freshnel isn't me [18:16:43] https://integration.wikimedia.org/ci/job/operations-mw-config-php72-composer-lint-docker/19990/console [18:16:53] jeena: I will clean the agent disks [18:16:57] "No space left on device" [18:16:59] (03CR) 10jerkins-bot: [V: 04-1] Benchmark loading DefaultSettings from YAML [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754911 (owner: 10Ppchelko) [18:17:16] pruning [18:17:30] I looked last friday at rebuilding the fleet of agents to have more disk space [18:17:32] * urbanecm goes to sync his patch that got merged [18:17:40] thanks hashar. I canceled the merge anyway [18:17:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [18:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:41] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: (Need By: TBD) rack/setup/install ml-staging200[12] - https://phabricator.wikimedia.org/T294946 (10Papaul) [18:18:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [18:18:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [18:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:37] (03CR) 10Jforrester: [C: 03+1] Disable UserMerge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754999 (https://phabricator.wikimedia.org/T216089) (owner: 10Majavah) [18:20:03] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 0ff5874469b717cba38ed7cff0669754517a3553: pwnwiki: Deploy Growth features to newcomers (T298115) (duration: 02m 14s) [18:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:07] T298115: Deploy Growth features at pwn.wikipedia.org - https://phabricator.wikimedia.org/T298115 [18:20:08] * urbanecm done with deployment [18:20:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [18:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:11] jeena: ^˘ [18:20:22] (03PS2) 10Ppchelko: First pass on creating config-schema.yaml [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754910 [18:20:42] (03PS4) 10Ppchelko: Benchmark loading DefaultSettings from YAML [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754911 [18:20:57] RhinosF1: jeena I have cleaned the CI agents [18:21:17] (03CR) 10Hashar: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755001 (owner: 10Jeena Huneidi) [18:21:45] hashar: let's see how many errors I cause this time [18:21:47] PROBLEM - Cassandra instance data free space on restbase2011 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7095 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [18:22:45] urbanecm: sorry I didn't see your earlier messages! [18:23:04] i saw you cancelled your merge, so i just finished what i wanted to do [18:23:46] I should have checked before trying to deploy to testwikis [18:23:55] train is on pause atm now though [18:24:08] hashar: only me caused errors at the moment [18:25:07] but yeah the train window is indeed at the time you thought, we just usually deploy to testwikis ahead of that [18:25:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [18:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:41] (03CR) 10Ebernhardson: sre.wdqs.data-reload: few fixes and cleanups (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/753426 (owner: 10DCausse) [18:26:12] jeena: i see, didn't know that. Thanks for explaining. Anyway, I'm done now :). [18:26:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [18:26:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [18:26:27] RECOVERY - Cassandra instance data free space on restbase2011 is OK: DISK OK - free space: /srv/cassandra/instance-data 11433 MB (32% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [18:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:56] Thanks! :) [18:27:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [18:27:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:10] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Hieradata yaml style checking - https://phabricator.wikimedia.org/T236954 (10colewhite) Thanks for looking into this! Automatic formatting would be great as long as the output is human-oriented. >>! In T236954#7624944, @jh... [18:31:11] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7360 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [18:32:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [18:35:57] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7068 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [18:38:10] (03CR) 10Cwhite: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/754995 (owner: 10Cwhite) [18:40:41] PROBLEM - Cassandra instance data free space on restbase2011 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7147 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [18:41:27] (03CR) 10jerkins-bot: [V: 04-1] logstash: gitlab: rename service field prior to populating object [puppet] - 10https://gerrit.wikimedia.org/r/754995 (owner: 10Cwhite) [18:42:29] (03CR) 10jerkins-bot: [V: 04-1] Benchmark loading DefaultSettings from YAML [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754911 (owner: 10Ppchelko) [18:42:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [18:44:43] 10ops-codfw, 10DC-Ops, 10Platform Engineering, 10RESTBase: Q3:(Need By: TBD) rack/setup/install restbase-dev200[123].codfw.wmnet - https://phabricator.wikimedia.org/T299437 (10RobH) [18:45:29] 10SRE, 10ops-codfw, 10DC-Ops, 10Platform Engineering, 10RESTBase: Q3:(Need By: TBD) rack/setup/install restbase-dev200[123].codfw.wmnet - https://phabricator.wikimedia.org/T299437 (10RobH) [18:46:41] (03CR) 10SBassett: [C: 03+1] doc.wikimedia.org CSP: Allow XHR requests to Wikipedia and Wikidata (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/754048 (https://phabricator.wikimedia.org/T285570) (owner: 10Catrope) [18:46:59] 10SRE, 10ops-codfw, 10DC-Ops, 10Platform Engineering, 10RESTBase: Q3:(Need By: TBD) rack/setup/install restbase-dev200[123].codfw.wmnet - https://phabricator.wikimedia.org/T299437 (10RobH) a:03Papaul [18:52:29] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7290 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [18:52:51] (03CR) 10Cwhite: "CI seems to be complaining about something different." [puppet] - 10https://gerrit.wikimedia.org/r/754995 (owner: 10Cwhite) [18:53:09] (03PS1) 10Jgiannelos: proton: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/755005 [18:58:18] (03CR) 10Jgiannelos: [C: 03+2] proton: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/755005 (owner: 10Jgiannelos) [18:59:33] PROBLEM - Cassandra instance data free space on restbase2011 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7331 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [18:59:33] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 6897 MB (19% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [19:00:05] Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220118T1900) [19:02:10] (03Merged) 10jenkins-bot: proton: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/755005 (owner: 10Jgiannelos) [19:05:21] (03CR) 1020after4: [C: 03+2] testwikis wikis to 1.38.0-wmf.18 refs T293959 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755001 (owner: 10Jeena Huneidi) [19:05:44] (03CR) 1020after4: testwikis wikis to 1.38.0-wmf.18 refs T293959 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755001 (owner: 10Jeena Huneidi) [19:06:29] whoops didn't noticce jeena had removed +2 [19:06:54] twentyafterfour: yes, paused the deployment due to an UBN [19:08:09] apologies, I won't deploy anything [19:08:25] (03PS5) 10Ppchelko: Benchmark loading DefaultSettings from YAML [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754911 [19:08:29] np, thanks though! [19:09:01] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 6818 MB (19% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [19:11:25] PROBLEM - Cassandra instance data free space on restbase2011 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7328 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [19:14:46] (03PS1) 10Ottomata: Prep for releasing ~wmf6 [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/755008 [19:15:07] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Prep for releasing ~wmf6 [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/755008 (owner: 10Ottomata) [19:16:07] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7334 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [19:20:51] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7304 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [19:21:46] (03CR) 10Ebernhardson: [C: 03+1] wcqs: set QUERY_SERVICE env name with wcqs/wdqs [puppet] - 10https://gerrit.wikimedia.org/r/753973 (owner: 10DCausse) [19:22:04] (03PS6) 10Ppchelko: Benchmark loading DefaultSettings from YAML [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754911 [19:23:15] PROBLEM - Cassandra instance data free space on restbase2011 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7250 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [19:25:35] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7362 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [19:30:58] (03CR) 10Herron: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/754995 (owner: 10Cwhite) [19:32:09] PROBLEM - Device not healthy -SMART- on restbase2010 is CRITICAL: cluster=restbase device={sde,sdf,sdg,sdh,sdi,sdj} instance=restbase2010 job=node site=codfw https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=restbase2010&var-datasource=codfw+prometheus/ops [19:32:11] PROBLEM - Cassandra instance data free space on restbase2011 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7142 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [19:32:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [19:35:14] (03PS1) 10Jdlrobson: SkinTemplate: Set template context in buildPersonalUrls() [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754912 (https://phabricator.wikimedia.org/T299352) [19:35:30] (03CR) 10Herron: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/754995 (owner: 10Cwhite) [19:36:39] PROBLEM - Cassandra instance data free space on restbase2011 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7163 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [19:37:51] (03PS1) 10Jdlrobson: Don't run Vector hook when menu absent from page [skins/Vector] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754913 (https://phabricator.wikimedia.org/T289619) [19:38:56] (03PS1) 10Jdlrobson: Restore icons to user links dropdown [skins/Vector] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755012 (https://phabricator.wikimedia.org/T289619) [19:42:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [19:43:35] (03CR) 10jerkins-bot: [V: 04-1] Benchmark loading DefaultSettings from YAML [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754911 (owner: 10Ppchelko) [19:48:31] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7069 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [19:52:14] 10SRE, 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))) - https://phabricator.wikimedia.org/T283582 (10hashar) [19:52:50] 10SRE, 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))) - https://phabricator.wikimedia.org/T283582 (10hashar) CI had to be restarted after the machine went up due to some od... [19:52:56] 10SRE, 10ops-codfw, 10Continuous-Integration-Infrastructure, 10serviceops-radar, 10Release-Engineering-Team (Radar): contint2001.mgmt disappeared from Icinga - https://phabricator.wikimedia.org/T298861 (10hashar) 05Stalled→03Resolved a:03jbond The DRAC on contint2001.wikimedia.org has been upgraded... [19:54:47] (03PS7) 10Ppchelko: Benchmark loading DefaultSettings from YAML [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754911 [19:54:59] (03PS8) 10Ppchelko: Benchmark loading DefaultSettings from YAML [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754911 [19:55:35] PROBLEM - Cassandra instance data free space on restbase2011 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 6792 MB (19% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [19:58:19] (03CR) 10Jeena Huneidi: [C: 03+2] SkinTemplate: Set template context in buildPersonalUrls() [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754912 (https://phabricator.wikimedia.org/T299352) (owner: 10Jdlrobson) [19:59:20] (03CR) 10Jdlrobson: [C: 03+1] Don't run Vector hook when menu absent from page [skins/Vector] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754913 (https://phabricator.wikimedia.org/T289619) (owner: 10Jdlrobson) [19:59:24] (03CR) 10Jdlrobson: [C: 03+1] Restore icons to user links dropdown [skins/Vector] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755012 (https://phabricator.wikimedia.org/T289619) (owner: 10Jdlrobson) [20:00:05] jeena and twentyafterfour: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220118T2000). [20:01:26] deploy will commence after merging some backports to the wmf.18 branch [20:01:49] (03CR) 10Jcrespo: [C: 03+2] mediabackups: Backup s7 media files at codfw [puppet] - 10https://gerrit.wikimedia.org/r/754025 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [20:04:02] (03CR) 10Cwhite: [V: 03+2 C: 03+2] logstash: gitlab: rename service field prior to populating object [puppet] - 10https://gerrit.wikimedia.org/r/754995 (owner: 10Cwhite) [20:04:22] (03PS2) 10Cwhite: logstash: gitlab: rename service field prior to populating object [puppet] - 10https://gerrit.wikimedia.org/r/754995 [20:04:28] (03CR) 10Cwhite: [V: 03+2 C: 03+2] logstash: gitlab: rename service field prior to populating object [puppet] - 10https://gerrit.wikimedia.org/r/754995 (owner: 10Cwhite) [20:12:13] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 6962 MB (19% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [20:16:14] (03CR) 10Herron: [C: 03+2] remove references to centrallog2001 [homer/public] - 10https://gerrit.wikimedia.org/r/754028 (https://phabricator.wikimedia.org/T298994) (owner: 10Herron) [20:16:55] RECOVERY - Cassandra instance data free space on restbase2012 is OK: DISK OK - free space: /srv/cassandra/instance-data 11801 MB (33% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [20:17:30] (03Merged) 10jenkins-bot: remove references to centrallog2001 [homer/public] - 10https://gerrit.wikimedia.org/r/754028 (https://phabricator.wikimedia.org/T298994) (owner: 10Herron) [20:18:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) [20:19:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) [20:19:38] (03Merged) 10jenkins-bot: SkinTemplate: Set template context in buildPersonalUrls() [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754912 (https://phabricator.wikimedia.org/T299352) (owner: 10Jdlrobson) [20:20:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) p:05Medium→03High Setting to high priority as this is the test bed order for the PERC H750 controller blocking other orders via T297913. Getting at... [20:20:46] (03CR) 10Jeena Huneidi: [C: 03+2] "backport" [skins/Vector] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755012 (https://phabricator.wikimedia.org/T289619) (owner: 10Jdlrobson) [20:21:31] PROBLEM - Cassandra instance data free space on restbase2011 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 6652 MB (18% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [20:21:39] (03CR) 10Jeena Huneidi: [C: 03+2] "backport" [skins/Vector] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754913 (https://phabricator.wikimedia.org/T289619) (owner: 10Jdlrobson) [20:23:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:38] (03CR) 10Daniel Kinzler: "Looks good to me! I'd like to hear from Timo though." [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754911 (owner: 10Ppchelko) [20:24:56] (03CR) 10Herron: "hmm, seeing "Unable to connect" errors while tying to apply this via homer https://phabricator.wikimedia.org/P18785" [homer/public] - 10https://gerrit.wikimedia.org/r/754028 (https://phabricator.wikimedia.org/T298994) (owner: 10Herron) [20:24:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:24:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:25:59] PROBLEM - Cassandra instance data free space on restbase2011 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 6758 MB (19% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [20:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:10] (03CR) 10Daniel Kinzler: [C: 03+1] "Good to go as an experiment. The new file isn't used anywhere (except for the experimental follow up patch). So this should be safe." [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754910 (owner: 10Ppchelko) [20:29:29] (03PS1) 104nn1l2: Revert "commonswiki: Add peerj.com to wgCopyUploadsDomains whitelist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754914 [20:30:23] PROBLEM - Cassandra instance data free space on restbase2011 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7143 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [20:32:37] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7334 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [20:38:24] (03PS9) 10Ppchelko: Benchmark loading DefaultSettings from YAML [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754911 [20:38:32] (03CR) 10Ppchelko: Benchmark loading DefaultSettings from YAML (032 comments) [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754911 (owner: 10Ppchelko) [20:38:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [20:38:56] (03Merged) 10jenkins-bot: Don't run Vector hook when menu absent from page [skins/Vector] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754913 (https://phabricator.wikimedia.org/T289619) (owner: 10Jdlrobson) [20:39:17] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7183 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [20:39:57] (03Merged) 10jenkins-bot: Restore icons to user links dropdown [skins/Vector] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755012 (https://phabricator.wikimedia.org/T289619) (owner: 10Jdlrobson) [20:40:15] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.0103 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [20:40:51] (03PS14) 10Brennen Bearnes: gitlab-runner: restrict docker images and services [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) [20:41:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:42:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:48] (03CR) 10jerkins-bot: [V: 04-1] gitlab-runner: restrict docker images and services [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) (owner: 10Brennen Bearnes) [20:43:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:05] PROBLEM - Cassandra instance data free space on restbase2011 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7199 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [20:47:14] (03CR) 10Jeena Huneidi: [C: 03+2] testwikis wikis to 1.38.0-wmf.18 refs T293959 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755001 (owner: 10Jeena Huneidi) [20:48:00] (03Merged) 10jenkins-bot: testwikis wikis to 1.38.0-wmf.18 refs T293959 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755001 (owner: 10Jeena Huneidi) [20:48:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:48:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:29] (03PS1) 104nn1l2: fawiki: Exempt userspaces from being indexed by search engines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755018 (https://phabricator.wikimedia.org/T299363) [20:48:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [20:49:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:49:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:49:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:54] !log jhuneidi@deploy1002 Started scap: testwikis to 1.38.0-wmf.18 refs T293959 [20:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:58] T293959: 1.38.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T293959 [20:55:31] PROBLEM - Cassandra instance data free space on restbase2011 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7016 MB (19% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [20:55:31] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7111 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [20:57:08] (03CR) 10Krinkle: Benchmark loading DefaultSettings from YAML (034 comments) [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754911 (owner: 10Ppchelko) [20:57:13] (03PS1) 10Ottomata: Use conda-environment.yaml for repeatable env builds [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/755021 [20:57:59] (03PS2) 10Ottomata: Use conda-environment.yaml for repeatable env builds [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/755021 [20:59:30] (03PS3) 10Ottomata: Use conda-environment.yaml for repeatable env builds [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/755021 [20:59:58] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Use conda-environment.yaml for repeatable env builds [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/755021 (owner: 10Ottomata) [21:00:01] PROBLEM - Disk space on deneb is CRITICAL: DISK CRITICAL - free space: / 10882 MB (4% inode=63%): /tmp 10882 MB (4% inode=63%): /var/tmp 10882 MB (4% inode=63%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=deneb&var-datasource=codfw+prometheus/ops [21:04:57] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7121 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [21:05:59] (03PS15) 10Brennen Bearnes: gitlab-runner: restrict docker images and services [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) [21:07:49] (03CR) 10jerkins-bot: [V: 04-1] gitlab-runner: restrict docker images and services [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) (owner: 10Brennen Bearnes) [21:14:01] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: miscweb1002, build2001, wdqs1010, labstore1006, labstore1007 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [21:14:27] PROBLEM - Cassandra instance data free space on restbase2011 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7235 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [21:23:28] (03PS10) 10Ppchelko: Benchmark loading DefaultSettings from YAML [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754911 [21:23:47] (03CR) 10Ppchelko: Benchmark loading DefaultSettings from YAML (034 comments) [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754911 (owner: 10Ppchelko) [21:24:28] (03PS11) 10Ppchelko: Benchmark loading DefaultSettings from YAML [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754911 [21:26:21] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7074 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [21:29:26] !log jhuneidi@deploy1002 Finished scap: testwikis to 1.38.0-wmf.18 refs T293959 (duration: 38m 31s) [21:29:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:29] T293959: 1.38.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T293959 [21:30:49] (03PS1) 10Hashar: Merge tag 'v3.3.9' into wmf/stable-3.3 [software/gerrit] (wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/755024 (https://phabricator.wikimedia.org/T240264) [21:34:17] (03CR) 10jerkins-bot: [V: 04-1] Merge tag 'v3.3.9' into wmf/stable-3.3 [software/gerrit] (wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/755024 (https://phabricator.wikimedia.org/T240264) (owner: 10Hashar) [21:34:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [21:35:27] (03PS1) 104nn1l2: azwiki: change alias Q to QA for the draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755026 (https://phabricator.wikimedia.org/T299332) [21:35:54] (03PS2) 10Hashar: Merge tag 'v3.3.9' into wmf/stable-3.3 [software/gerrit] (wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/755024 (https://phabricator.wikimedia.org/T240264) [21:37:41] deploying to group0 shortly [21:37:47] (03CR) 10jerkins-bot: [V: 04-1] Merge tag 'v3.3.9' into wmf/stable-3.3 [software/gerrit] (wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/755024 (https://phabricator.wikimedia.org/T240264) (owner: 10Hashar) [21:39:32] (03PS1) 10Jeena Huneidi: group0 wikis to 1.38.0-wmf.18 refs T293959 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755027 [21:39:38] (03CR) 10Jeena Huneidi: [C: 03+2] group0 wikis to 1.38.0-wmf.18 refs T293959 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755027 (owner: 10Jeena Huneidi) [21:40:05] (03PS2) 104nn1l2: azwiki: Change alias Q to QA for the draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755026 (https://phabricator.wikimedia.org/T299332) [21:40:32] (03Merged) 10jenkins-bot: group0 wikis to 1.38.0-wmf.18 refs T293959 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755027 (owner: 10Jeena Huneidi) [21:41:08] (03PS1) 10Hashar: Update Gerrit to 3.3.9 [software/gerrit] (deploy/wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/755028 (https://phabricator.wikimedia.org/T299451) [21:42:26] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.38.0-wmf.18 refs T293959 [21:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:29] (03CR) 10jerkins-bot: [V: 04-1] Benchmark loading DefaultSettings from YAML [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754911 (owner: 10Ppchelko) [21:42:30] T293959: 1.38.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T293959 [21:44:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [21:46:40] (03PS3) 10Hashar: Merge tag 'v3.3.9' into wmf/stable-3.3 [software/gerrit] (wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/755024 (https://phabricator.wikimedia.org/T240264) [21:53:48] (03PS2) 10Ryan Kemper: elasticsearch: decom elastic10[32-47] (step 4) [puppet] - 10https://gerrit.wikimedia.org/r/736119 (https://phabricator.wikimedia.org/T294805) [21:55:05] (03PS2) 10Ryan Kemper: elasticsearch: hiera for new eqiad nodes (step 1) [puppet] - 10https://gerrit.wikimedia.org/r/736116 (https://phabricator.wikimedia.org/T294805) [21:55:07] (03PS2) 10Ryan Kemper: elasticsearch: activate role (step 2) [puppet] - 10https://gerrit.wikimedia.org/r/736117 (https://phabricator.wikimedia.org/T294805) [21:55:09] (03PS2) 10Ryan Kemper: elasticsearch: new master config (step 3) [puppet] - 10https://gerrit.wikimedia.org/r/736118 (https://phabricator.wikimedia.org/T294805) [21:55:11] (03PS3) 10Ryan Kemper: elasticsearch: decom elastic10[32-47] (step 4) [puppet] - 10https://gerrit.wikimedia.org/r/736119 (https://phabricator.wikimedia.org/T294805) [21:57:42] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti10[29|3(012)] - https://phabricator.wikimedia.org/T299459 (10RobH) [21:57:48] (03PS4) 10Ryan Kemper: elasticsearch: decom elastic10[32-47] (step 4) [puppet] - 10https://gerrit.wikimedia.org/r/736119 (https://phabricator.wikimedia.org/T294805) [21:57:55] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti10[29|3(012)] - https://phabricator.wikimedia.org/T299459 (10RobH) [21:58:21] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar), 10User-ema: Experiment with single backend CDN nodes - https://phabricator.wikimedia.org/T288106 (10Krinkle) [21:59:36] (03CR) 10Ryan Kemper: elasticsearch: decom elastic10[32-47] (step 4) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736119 (https://phabricator.wikimedia.org/T294805) (owner: 10Ryan Kemper) [21:59:57] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar), 10User-ema: Package and deploy Varnish 6.0.9 - https://phabricator.wikimedia.org/T298758 (10Krinkle) [22:07:22] (03CR) 10Brennen Bearnes: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) (owner: 10Brennen Bearnes) [22:07:26] (03PS12) 10Ppchelko: Benchmark loading DefaultSettings from YAML [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754911 [22:08:54] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10RobH) [22:09:27] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10RobH) [22:10:11] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10RobH) a:03Jclark-ctr [22:12:16] (03PS2) 10Hashar: Update Gerrit to 3.3.9 + plugins [software/gerrit] (deploy/wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/755028 (https://phabricator.wikimedia.org/T240264) [22:20:45] (03CR) 10Ryan Kemper: elasticsearch: activate role (step 2) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736117 (https://phabricator.wikimedia.org/T294805) (owner: 10Ryan Kemper) [22:27:13] (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [22:29:51] (03PS1) 10Cwhite: bump patch version to update plugins [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/755033 [22:31:19] (03CR) 10Ppchelko: [C: 04-2] "After discussing with Tim, this should go into mediawiki-config repo." [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754911 (owner: 10Ppchelko) [22:34:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [22:36:03] (03PS1) 10Bartosz Dziewoński: Enable wikis to customize the syntax used for replies [extensions/DiscussionTools] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/754915 (https://phabricator.wikimedia.org/T259864) [22:36:05] (03CR) 10Cwhite: [C: 03+2] logstash: add optional document_type parameter to es output config [puppet] - 10https://gerrit.wikimedia.org/r/747634 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite) [22:36:23] (03PS1) 10Bartosz Dziewoński: Ensure the marker appears in a reasonable place when replying with a bullet [extensions/DiscussionTools] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/754916 (https://phabricator.wikimedia.org/T259864) [22:37:13] (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [22:37:23] (03CR) 10Cwhite: [C: 03+2] logstash: add optional document_type parameter to es output config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/747634 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite) [22:37:49] (03PS3) 10Bartosz Dziewoński: DiscussionTools: Use bullet indentation on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753192 (https://phabricator.wikimedia.org/T259864) [22:40:41] (03PS5) 10Cwhite: logstash: add opensearch output config definition [puppet] - 10https://gerrit.wikimedia.org/r/727624 (https://phabricator.wikimedia.org/T288618) [22:43:06] (03PS6) 10Cwhite: logstash: add opensearch output config definition [puppet] - 10https://gerrit.wikimedia.org/r/727624 (https://phabricator.wikimedia.org/T288618) [22:43:49] (03PS7) 10Cwhite: logstash: add opensearch output config definition [puppet] - 10https://gerrit.wikimedia.org/r/727624 (https://phabricator.wikimedia.org/T288618) [22:44:25] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7324 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [22:44:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [22:45:38] (03CR) 10Cwhite: [C: 03+2] logstash: add opensearch output config definition [puppet] - 10https://gerrit.wikimedia.org/r/727624 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [22:46:29] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2001 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: build2001, labstore1007, wdqs1010, labstore1006, miscweb1002 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [22:50:07] Hey all - I'd like to deploy a security patch for T298434 to wmf.18 and wmf.17 now. Let me know if I shouldn't... [22:51:31] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7347 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [22:52:49] (03PS1) 10Clare Ming: Update config for pilot wikis: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755038 (https://phabricator.wikimedia.org/T298519) [22:56:17] RECOVERY - Cassandra instance data free space on restbase2011 is OK: DISK OK - free space: /srv/cassandra/instance-data 12032 MB (34% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [22:57:20] !log Deployed security patch for T298434 to 1.380-wmf.17 [22:57:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:38] !log Deployed security patch for T298434 to 1.38.0-wmf.18 [22:59:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:05] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7061 MB (19% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [23:05:12] (03PS1) 10Zabe: Don't use array keys for OOUI [extensions/AbuseFilter] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754917 (https://phabricator.wikimedia.org/T299463) [23:07:55] PROBLEM - MariaDB Replica IO: x1 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2096.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:08:03] PROBLEM - MariaDB Replica IO: s2 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2104.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:10:05] I need to rebalance db2101, it gets too loaded at peak backup time [23:14:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: (Need By: TBD) rack/setup/install stat1009 - https://phabricator.wikimedia.org/T299466 (10RobH) [23:14:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: (Need By: TBD) rack/setup/install stat1009 - https://phabricator.wikimedia.org/T299466 (10RobH) [23:17:48] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) [23:20:10] (03PS1) 10Cwhite: beta-logs: use opensearch output plugin [puppet] - 10https://gerrit.wikimedia.org/r/755040 (https://phabricator.wikimedia.org/T299168) [23:21:59] (03CR) 10jerkins-bot: [V: 04-1] beta-logs: use opensearch output plugin [puppet] - 10https://gerrit.wikimedia.org/r/755040 (https://phabricator.wikimedia.org/T299168) (owner: 10Cwhite) [23:22:05] PROBLEM - MariaDB Replica IO: s5 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2123.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:23:12] (03CR) 10Cwhite: [V: 03+2 C: 03+2] beta-logs: use opensearch output plugin [puppet] - 10https://gerrit.wikimedia.org/r/755040 (https://phabricator.wikimedia.org/T299168) (owner: 10Cwhite) [23:23:31] PROBLEM - MariaDB Replica Lag: s2 on db2101 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 449.43 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:24:27] RECOVERY - MariaDB Replica IO: s5 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:24:31] RECOVERY - MariaDB Replica IO: x1 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:24:39] RECOVERY - MariaDB Replica IO: s2 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:24:45] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7401 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [23:25:51] RECOVERY - MariaDB Replica Lag: s2 on db2101 is OK: OK slave_sql_lag Replication lag: 1.48 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:29:29] RECOVERY - Cassandra instance data free space on restbase2012 is OK: DISK OK - free space: /srv/cassandra/instance-data 12466 MB (35% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [23:33:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [23:34:03] (03CR) 10Clare Ming: "I have a question out to Olga confirming that the language alert in the sidebar should be enabled for pilot wikis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755038 (https://phabricator.wikimedia.org/T298519) (owner: 10Clare Ming) [23:34:50] (03PS1) 10Cwhite: prepare for logstash 7.16.3 [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/755041 (https://phabricator.wikimedia.org/T299168) [23:36:14] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10RobH) [23:36:54] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10RobH) [23:39:04] (03PS1) 10Zabe: Don't use array keys for OOUI in AbuseFilterViewDiff [extensions/AbuseFilter] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754918 (https://phabricator.wikimedia.org/T299463) [23:39:15] (03PS1) 10Cwhite: builder: add opensearch1 pbuilder hooks for logstash-plugins update [puppet] - 10https://gerrit.wikimedia.org/r/755043 (https://phabricator.wikimedia.org/T299168) [23:41:45] (03PS2) 104nn1l2: Revert "commonswiki: Add peerj.com to wgCopyUploadsDomains whitelist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754914 [23:43:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [23:57:01] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Kubernetes: (Need By: TBD) rack/setup/install kubernetes20[19|2(012)] - https://phabricator.wikimedia.org/T299470 (10RobH) [23:58:02] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Kubernetes: (Need By: TBD) rack/setup/install kubernetes20[19|2(012)] - https://phabricator.wikimedia.org/T299470 (10RobH)