[00:01:46] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [00:02:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [00:02:40] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 235 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:04:38] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 29 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:07:04] PROBLEM - SSH on mw1338.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:07:40] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:07:40] PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:07:54] RECOVERY - SSH on mw1338.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:17:12] (03PS1) 10Stang: Revert "votewiki: Change wgLanguageCode to zh for Sep 2022 admins election" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850577 [01:19:20] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 188 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:31:10] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:38:45] (JobUnavailable) firing: (9) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:48:45] (JobUnavailable) firing: (10) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:00:15] (MjolnirUpdateFailureRateExceedesThreshold) firing: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [02:05:15] (MjolnirUpdateFailureRateExceedesThreshold) resolved: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [02:08:45] (JobUnavailable) firing: (6) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:15:58] PROBLEM - SSH on mw1334.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:27:18] PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:54:13] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [02:59:13] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:09:32] RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:28:16] RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:28:16] PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:34:14] (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:02:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [04:29:08] RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:28:05] (03PS2) 10KartikMistry: Update cxserver to 2022-10-27-102021-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/850315 (https://phabricator.wikimedia.org/T225494) [05:36:48] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2022-10-27-102021-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/850315 (https://phabricator.wikimedia.org/T225494) (owner: 10KartikMistry) [05:40:12] (03Merged) 10jenkins-bot: Update cxserver to 2022-10-27-102021-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/850315 (https://phabricator.wikimedia.org/T225494) (owner: 10KartikMistry) [05:41:50] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [05:42:19] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [05:42:22] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [05:47:04] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 105 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:47:36] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [05:48:34] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [05:49:02] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:51:18] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [05:52:26] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [05:52:36] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/title/{title}/{from}/{to} (Suggest a target title for the given source title and language pairs) is CRITICAL: Test Suggest a target title for the given source title and language pairs returned the unexpected status 403 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [05:54:47] Hmm. This shouldn't happen.. [05:57:38] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/title/{title}/{from}/{to} (Suggest a target title for the given source title and language pairs) is CRITICAL: Test Suggest a target title for the given source title and language pairs returned the unexpected status 403 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [06:00:40] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [06:01:18] (03PS1) 10KartikMistry: Revert "Update cxserver to 2022-10-27-102021-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/850578 [06:08:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1122.eqiad.wmnet with reason: Maintenance [06:08:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1122.eqiad.wmnet with reason: Maintenance [06:09:00] (JobUnavailable) firing: Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:09:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2117.codfw.wmnet with reason: Maintenance [06:10:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2117.codfw.wmnet with reason: Maintenance [06:10:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T321123)', diff saved to https://phabricator.wikimedia.org/P37043 and previous config saved to /var/cache/conftool/dbconfig/20221031-061026-marostegui.json [06:10:33] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [06:11:07] (03CR) 10KartikMistry: [C: 03+2] Revert "Update cxserver to 2022-10-27-102021-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/850578 (owner: 10KartikMistry) [06:12:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T321123)', diff saved to https://phabricator.wikimedia.org/P37044 and previous config saved to /var/cache/conftool/dbconfig/20221031-061236-marostegui.json [06:14:56] (03Merged) 10jenkins-bot: Revert "Update cxserver to 2022-10-27-102021-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/850578 (owner: 10KartikMistry) [06:15:55] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [06:16:15] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [06:16:42] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [06:17:19] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [06:17:48] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [06:18:02] PROBLEM - SSH on db1123.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:18:14] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [06:18:16] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [06:19:20] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [06:19:42] RECOVERY - SSH on mw1334.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:27:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P37045 and previous config saved to /var/cache/conftool/dbconfig/20221031-062743-marostegui.json [06:42:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P37046 and previous config saved to /var/cache/conftool/dbconfig/20221031-064249-marostegui.json [06:54:13] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [06:57:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T321123)', diff saved to https://phabricator.wikimedia.org/P37047 and previous config saved to /var/cache/conftool/dbconfig/20221031-065756-marostegui.json [06:57:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2124.codfw.wmnet with reason: Maintenance [06:58:03] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [06:58:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2124.codfw.wmnet with reason: Maintenance [06:58:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2124 (T321123)', diff saved to https://phabricator.wikimedia.org/P37048 and previous config saved to /var/cache/conftool/dbconfig/20221031-065817-marostegui.json [06:59:13] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:59:44] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 143 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:00:05] Amir1 and Urbanecm: That opportune time is upon us again. Time for a UTC morning backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221031T0700). [07:00:05] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:22] o/ [07:00:25] i can deploy today [07:00:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T321123)', diff saved to https://phabricator.wikimedia.org/P37049 and previous config saved to /var/cache/conftool/dbconfig/20221031-070029-marostegui.json [07:00:34] kart_: hi [07:01:44] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:01:50] (03CR) 10ArielGlenn: [C: 03+1] "Thumbs up from me too." [puppet] - 10https://gerrit.wikimedia.org/r/850471 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [07:02:22] !log [WDQS] `ryankemper@wdqs1007:~$ sudo systemctl restart wdqs-blazegraph.service` [07:02:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:43] RECOVERY - WDQS SPARQL on wdqs1007 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.143 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [07:03:52] RECOVERY - Query Service HTTP Port on wdqs1007 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [07:04:28] urbanecm: Sorry, slightly late.. [07:04:44] kart_: no worries. do you want to self-serve, or should i deploy? [07:05:14] Yeah, will do self-serve deployment. Starting in a minute.. [07:05:28] ack [07:06:36] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845573 (https://phabricator.wikimedia.org/T317289) (owner: 10KartikMistry) [07:07:22] (03Merged) 10jenkins-bot: Enable Section Translation in Hawaiian, Pashto and Xhosa WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845573 (https://phabricator.wikimedia.org/T317289) (owner: 10KartikMistry) [07:07:41] !log kartik@deploy1002 Started scap: Backport for [[gerrit:845573|Enable Section Translation in Hawaiian, Pashto and Xhosa WPs (T317289)]] [07:07:47] T317289: Enable Content and Section translation on 6 Wikipedias - https://phabricator.wikimedia.org/T317289 [07:08:04] !log kartik@deploy1002 kartik and kartik: Backport for [[gerrit:845573|Enable Section Translation in Hawaiian, Pashto and Xhosa WPs (T317289)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [07:12:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [07:13:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [07:13:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [07:14:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [07:14:30] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:845573|Enable Section Translation in Hawaiian, Pashto and Xhosa WPs (T317289)]] (duration: 06m 48s) [07:14:35] T317289: Enable Content and Section translation on 6 Wikipedias - https://phabricator.wikimedia.org/T317289 [07:15:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P37050 and previous config saved to /var/cache/conftool/dbconfig/20221031-071536-marostegui.json [07:17:05] urbanecm: done :) [07:18:44] yay! :) [07:18:52] RECOVERY - SSH on db1123.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:29:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [07:30:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P37051 and previous config saved to /var/cache/conftool/dbconfig/20221031-073042-marostegui.json [07:32:14] PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:33:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [07:33:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [07:34:14] (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:37:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [07:45:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T321123)', diff saved to https://phabricator.wikimedia.org/P37052 and previous config saved to /var/cache/conftool/dbconfig/20221031-074549-marostegui.json [07:45:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2129.codfw.wmnet with reason: Maintenance [07:45:56] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [07:46:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2129.codfw.wmnet with reason: Maintenance [07:46:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2129 (T321123)', diff saved to https://phabricator.wikimedia.org/P37053 and previous config saved to /var/cache/conftool/dbconfig/20221031-074611-marostegui.json [07:48:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T321123)', diff saved to https://phabricator.wikimedia.org/P37054 and previous config saved to /var/cache/conftool/dbconfig/20221031-074823-marostegui.json [07:52:59] (03PS3) 10Matthias Mullie: Enable ImageSuggestions on ca, no, fi & huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850425 (https://phabricator.wikimedia.org/T300064) [08:02:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [08:03:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P37055 and previous config saved to /var/cache/conftool/dbconfig/20221031-080329-marostegui.json [08:04:18] (ProbeDown) firing: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:04:48] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:06:46] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:08:08] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting deployment group membership for mfossati - https://phabricator.wikimedia.org/T321772 (10SLyngshede-WMF) [08:09:18] (ProbeDown) resolved: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:10:32] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting deployment group membership for mfossati - https://phabricator.wikimedia.org/T321772 (10SLyngshede-WMF) @thcipriani do you function as a WMF sponsor/manager as well in this case? [08:18:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P37056 and previous config saved to /var/cache/conftool/dbconfig/20221031-081836-marostegui.json [08:33:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T321123)', diff saved to https://phabricator.wikimedia.org/P37057 and previous config saved to /var/cache/conftool/dbconfig/20221031-083342-marostegui.json [08:33:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2141.codfw.wmnet with reason: Maintenance [08:33:49] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [08:33:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2141.codfw.wmnet with reason: Maintenance [08:34:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2158.codfw.wmnet with reason: Maintenance [08:34:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2158.codfw.wmnet with reason: Maintenance [08:34:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db2095.codfw.wmnet with reason: Maintenance [08:34:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2095.codfw.wmnet with reason: Maintenance [08:34:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2158 (T321123)', diff saved to https://phabricator.wikimedia.org/P37058 and previous config saved to /var/cache/conftool/dbconfig/20221031-083449-marostegui.json [08:37:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T321123)', diff saved to https://phabricator.wikimedia.org/P37059 and previous config saved to /var/cache/conftool/dbconfig/20221031-083701-marostegui.json [08:40:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance [08:40:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1102.eqiad.wmnet with reason: Maintenance [08:40:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance [08:40:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1102.eqiad.wmnet with reason: Maintenance [08:47:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1098.eqiad.wmnet with reason: Maintenance [08:47:29] !log ladsgroup@cumin1001 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 8:00:00 on db1098.eqiad.wmnet with reason: Maintenance [08:47:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1098.eqiad.wmnet with reason: Maintenance [08:47:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1098.eqiad.wmnet with reason: Maintenance [08:47:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T318955)', diff saved to https://phabricator.wikimedia.org/P37060 and previous config saved to /var/cache/conftool/dbconfig/20221031-084751-ladsgroup.json [08:47:57] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [08:48:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2098.codfw.wmnet with reason: Maintenance [08:48:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2098.codfw.wmnet with reason: Maintenance [08:48:42] (03PS2) 10Awight: Invite some of WMDE Tech Wishes team to poke around maps instances [puppet] - 10https://gerrit.wikimedia.org/r/850160 (https://phabricator.wikimedia.org/T321722) [08:48:58] (03CR) 10Awight: "PS 2: I've downgraded our request to two -admins groups" [puppet] - 10https://gerrit.wikimedia.org/r/850160 (https://phabricator.wikimedia.org/T321722) (owner: 10Awight) [08:50:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1100.eqiad.wmnet with reason: Maintenance [08:50:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1100.eqiad.wmnet with reason: Maintenance [08:52:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P37061 and previous config saved to /var/cache/conftool/dbconfig/20221031-085208-marostegui.json [08:52:26] (03PS3) 10Awight: Invite some of WMDE Tech Wishes team to poke around maps instances [puppet] - 10https://gerrit.wikimedia.org/r/850160 (https://phabricator.wikimedia.org/T321722) [08:57:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2100.codfw.wmnet with reason: Maintenance [08:57:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2100.codfw.wmnet with reason: Maintenance [08:58:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [08:58:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [08:58:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T318950)', diff saved to https://phabricator.wikimedia.org/P37062 and previous config saved to /var/cache/conftool/dbconfig/20221031-085839-ladsgroup.json [08:58:45] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [08:59:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T318955)', diff saved to https://phabricator.wikimedia.org/P37063 and previous config saved to /var/cache/conftool/dbconfig/20221031-085920-ladsgroup.json [08:59:26] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [09:00:26] (03PS1) 10Matthias Mullie: Update i18n for ca, nb, fi & hu [extensions/ImageSuggestions] (wmf/1.40.0-wmf.7) - 10https://gerrit.wikimedia.org/r/850985 (https://phabricator.wikimedia.org/T300064) [09:06:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2108.codfw.wmnet with reason: Maintenance [09:06:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2108.codfw.wmnet with reason: Maintenance [09:06:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2108 (T318955)', diff saved to https://phabricator.wikimedia.org/P37064 and previous config saved to /var/cache/conftool/dbconfig/20221031-090640-ladsgroup.json [09:06:46] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [09:07:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P37065 and previous config saved to /var/cache/conftool/dbconfig/20221031-090714-marostegui.json [09:11:56] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:14:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P37066 and previous config saved to /var/cache/conftool/dbconfig/20221031-091426-ladsgroup.json [09:17:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T318955)', diff saved to https://phabricator.wikimedia.org/P37067 and previous config saved to /var/cache/conftool/dbconfig/20221031-091735-ladsgroup.json [09:17:42] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [09:17:52] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:17:58] !log set thanos ring replicas to 3.40 T311690 [09:18:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:03] T311690: Shorten Thanos retention - https://phabricator.wikimedia.org/T311690 [09:21:17] (03PS1) 10Filippo Giunchedi: hieradata: don't monitor /run/docker on alerting_host [puppet] - 10https://gerrit.wikimedia.org/r/850993 (https://phabricator.wikimedia.org/T313229) [09:21:42] (03CR) 10CI reject: [V: 04-1] hieradata: don't monitor /run/docker on alerting_host [puppet] - 10https://gerrit.wikimedia.org/r/850993 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [09:22:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T321123)', diff saved to https://phabricator.wikimedia.org/P37068 and previous config saved to /var/cache/conftool/dbconfig/20221031-092221-marostegui.json [09:22:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2169.codfw.wmnet with reason: Maintenance [09:22:27] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [09:22:36] PROBLEM - SSH on mw1334.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:22:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2169.codfw.wmnet with reason: Maintenance [09:22:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3316 (T321123)', diff saved to https://phabricator.wikimedia.org/P37069 and previous config saved to /var/cache/conftool/dbconfig/20221031-092242-marostegui.json [09:22:57] (03PS2) 10Filippo Giunchedi: hieradata: don't monitor /run/docker on alerting_host [puppet] - 10https://gerrit.wikimedia.org/r/850993 (https://phabricator.wikimedia.org/T313229) [09:23:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T321123)', diff saved to https://phabricator.wikimedia.org/P37070 and previous config saved to /var/cache/conftool/dbconfig/20221031-092354-marostegui.json [09:26:59] (03PS2) 10Slyngshede: Bitu IDM, initial checkin [software/bitu] - 10https://gerrit.wikimedia.org/r/850465 (https://phabricator.wikimedia.org/T319410) [09:29:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P37071 and previous config saved to /var/cache/conftool/dbconfig/20221031-092933-ladsgroup.json [09:29:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1105.eqiad.wmnet with reason: Maintenance [09:30:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1105.eqiad.wmnet with reason: Maintenance [09:30:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T318605)', diff saved to https://phabricator.wikimedia.org/P37072 and previous config saved to /var/cache/conftool/dbconfig/20221031-093012-ladsgroup.json [09:30:20] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [09:30:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: Maintenance [09:30:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: Maintenance [09:31:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2104 (T318605)', diff saved to https://phabricator.wikimedia.org/P37073 and previous config saved to /var/cache/conftool/dbconfig/20221031-093102-ladsgroup.json [09:32:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P37074 and previous config saved to /var/cache/conftool/dbconfig/20221031-093242-ladsgroup.json [09:34:02] RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:36:35] 10SRE, 10Infrastructure-Foundations: Pick a name for the IDM - https://phabricator.wikimedia.org/T319409 (10SLyngshede-WMF) 05Open→03Resolved [09:36:37] 10SRE, 10Infrastructure-Foundations: IDM milestone 1 "Initial development work" - https://phabricator.wikimedia.org/T319407 (10SLyngshede-WMF) [09:37:20] 10SRE, 10Infrastructure-Foundations: Pick a name for the IDM - https://phabricator.wikimedia.org/T319409 (10SLyngshede-WMF) Name has been picked ([[ https://en.wikipedia.org/wiki/Bitu_(god) | Bitu ]]) [09:39:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P37075 and previous config saved to /var/cache/conftool/dbconfig/20221031-093900-marostegui.json [09:39:55] (03CR) 10Slyngshede: [C: 03+2] role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [09:40:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T318605)', diff saved to https://phabricator.wikimedia.org/P37076 and previous config saved to /var/cache/conftool/dbconfig/20221031-094045-ladsgroup.json [09:40:51] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [09:44:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T318955)', diff saved to https://phabricator.wikimedia.org/P37077 and previous config saved to /var/cache/conftool/dbconfig/20221031-094439-ladsgroup.json [09:44:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1101.eqiad.wmnet with reason: Maintenance [09:44:46] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [09:44:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1101.eqiad.wmnet with reason: Maintenance [09:45:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T318955)', diff saved to https://phabricator.wikimedia.org/P37078 and previous config saved to /var/cache/conftool/dbconfig/20221031-094501-ladsgroup.json [09:47:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P37079 and previous config saved to /var/cache/conftool/dbconfig/20221031-094748-ladsgroup.json [09:52:37] !log depooling wdqs1007 while it catches up on lag - T322010 [09:52:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:43] T322010: Depool wdqs1007 - https://phabricator.wikimedia.org/T322010 [09:54:03] (03PS1) 10Slyngshede: profile::idm incorrectly named variable. [puppet] - 10https://gerrit.wikimedia.org/r/850998 (https://phabricator.wikimedia.org/T320428) [09:54:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P37080 and previous config saved to /var/cache/conftool/dbconfig/20221031-095407-marostegui.json [09:55:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P37081 and previous config saved to /var/cache/conftool/dbconfig/20221031-095551-ladsgroup.json [09:56:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T318955)', diff saved to https://phabricator.wikimedia.org/P37082 and previous config saved to /var/cache/conftool/dbconfig/20221031-095657-ladsgroup.json [09:57:03] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [09:58:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T318950)', diff saved to https://phabricator.wikimedia.org/P37083 and previous config saved to /var/cache/conftool/dbconfig/20221031-095855-ladsgroup.json [09:59:02] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [10:00:31] (03CR) 10Slyngshede: [C: 03+2] profile::idm incorrectly named variable. [puppet] - 10https://gerrit.wikimedia.org/r/850998 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [10:02:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T318955)', diff saved to https://phabricator.wikimedia.org/P37084 and previous config saved to /var/cache/conftool/dbconfig/20221031-100255-ladsgroup.json [10:02:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2120.codfw.wmnet with reason: Maintenance [10:03:02] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [10:03:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2120.codfw.wmnet with reason: Maintenance [10:03:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2120 (T318955)', diff saved to https://phabricator.wikimedia.org/P37085 and previous config saved to /var/cache/conftool/dbconfig/20221031-100316-ladsgroup.json [10:04:28] (03PS1) 10Filippo Giunchedi: dispatch: enforce ssl for dispatch DB user [puppet] - 10https://gerrit.wikimedia.org/r/850999 (https://phabricator.wikimedia.org/T313229) [10:09:00] (JobUnavailable) firing: Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:09:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T321123)', diff saved to https://phabricator.wikimedia.org/P37086 and previous config saved to /var/cache/conftool/dbconfig/20221031-100913-marostegui.json [10:09:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2171.codfw.wmnet with reason: Maintenance [10:09:21] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [10:09:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2171.codfw.wmnet with reason: Maintenance [10:09:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3316 (T321123)', diff saved to https://phabricator.wikimedia.org/P37087 and previous config saved to /var/cache/conftool/dbconfig/20221031-100935-marostegui.json [10:10:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T321123)', diff saved to https://phabricator.wikimedia.org/P37088 and previous config saved to /var/cache/conftool/dbconfig/20221031-101046-marostegui.json [10:11:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P37089 and previous config saved to /var/cache/conftool/dbconfig/20221031-101059-ladsgroup.json [10:12:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P37090 and previous config saved to /var/cache/conftool/dbconfig/20221031-101203-ladsgroup.json [10:12:31] (03PS1) 10Filippo Giunchedi: dispatch: run the scheduler on active host only [puppet] - 10https://gerrit.wikimedia.org/r/851000 (https://phabricator.wikimedia.org/T313229) [10:14:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P37091 and previous config saved to /var/cache/conftool/dbconfig/20221031-101402-ladsgroup.json [10:14:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T318955)', diff saved to https://phabricator.wikimedia.org/P37092 and previous config saved to /var/cache/conftool/dbconfig/20221031-101422-ladsgroup.json [10:14:28] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [10:16:39] (03PS1) 10Filippo Giunchedi: idp: fix vhost_settings for dispatch_port [puppet] - 10https://gerrit.wikimedia.org/r/851001 (https://phabricator.wikimedia.org/T313229) [10:19:09] (03PS2) 10Filippo Giunchedi: idp: fix vhost_settings for dispatch_port [puppet] - 10https://gerrit.wikimedia.org/r/851001 (https://phabricator.wikimedia.org/T313229) [10:23:46] (03PS1) 10Elukey: Set coredns 1.8.7 for ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/851002 (https://phabricator.wikimedia.org/T321159) [10:25:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P37093 and previous config saved to /var/cache/conftool/dbconfig/20221031-102552-marostegui.json [10:26:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T318605)', diff saved to https://phabricator.wikimedia.org/P37094 and previous config saved to /var/cache/conftool/dbconfig/20221031-102606-ladsgroup.json [10:26:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance [10:26:12] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [10:26:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T318605)', diff saved to https://phabricator.wikimedia.org/P37095 and previous config saved to /var/cache/conftool/dbconfig/20221031-102612-ladsgroup.json [10:26:14] (03CR) 10Elukey: [C: 03+1] "LGTM, really nice refactor :)" [puppet] - 10https://gerrit.wikimedia.org/r/850449 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [10:26:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance [10:26:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2125 (T318605)', diff saved to https://phabricator.wikimedia.org/P37096 and previous config saved to /var/cache/conftool/dbconfig/20221031-102627-ladsgroup.json [10:26:32] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37857/console" [puppet] - 10https://gerrit.wikimedia.org/r/851001 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [10:26:47] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] idp: fix vhost_settings for dispatch_port [puppet] - 10https://gerrit.wikimedia.org/r/851001 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [10:27:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P37097 and previous config saved to /var/cache/conftool/dbconfig/20221031-102710-ladsgroup.json [10:28:11] (03CR) 10Jbond: ci: move list of contint and zuul hosts to hieradata/common.yaml (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn) [10:28:23] (03CR) 10Elukey: [C: 04-1] "This is not right, working on it." [deployment-charts] - 10https://gerrit.wikimedia.org/r/851002 (https://phabricator.wikimedia.org/T321159) (owner: 10Elukey) [10:29:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P37098 and previous config saved to /var/cache/conftool/dbconfig/20221031-102908-ladsgroup.json [10:29:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P37099 and previous config saved to /var/cache/conftool/dbconfig/20221031-102928-ladsgroup.json [10:30:37] (03PS2) 10Elukey: Set coredns 1.8.7 for ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/851002 (https://phabricator.wikimedia.org/T321159) [10:40:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P37100 and previous config saved to /var/cache/conftool/dbconfig/20221031-104059-marostegui.json [10:41:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P37101 and previous config saved to /var/cache/conftool/dbconfig/20221031-104119-ladsgroup.json [10:42:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T318955)', diff saved to https://phabricator.wikimedia.org/P37102 and previous config saved to /var/cache/conftool/dbconfig/20221031-104217-ladsgroup.json [10:42:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1127.eqiad.wmnet with reason: Maintenance [10:42:23] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [10:42:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1127.eqiad.wmnet with reason: Maintenance [10:42:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T318955)', diff saved to https://phabricator.wikimedia.org/P37103 and previous config saved to /var/cache/conftool/dbconfig/20221031-104238-ladsgroup.json [10:44:10] (03CR) 10Jbond: [C: 03+2] Add support for bookworm to apt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/850484 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [10:44:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T318950)', diff saved to https://phabricator.wikimedia.org/P37104 and previous config saved to /var/cache/conftool/dbconfig/20221031-104415-ladsgroup.json [10:44:22] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [10:44:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P37105 and previous config saved to /var/cache/conftool/dbconfig/20221031-104435-ladsgroup.json [10:47:07] (03PS1) 10Filippo Giunchedi: hieradata: add dispatch.w.o to IDP [puppet] - 10https://gerrit.wikimedia.org/r/851003 (https://phabricator.wikimedia.org/T313229) [10:47:09] (03PS1) 10Filippo Giunchedi: hieradata: fix Prometheus IDP entry [puppet] - 10https://gerrit.wikimedia.org/r/851004 (https://phabricator.wikimedia.org/T313229) [10:51:24] I have two very simple patches, if anyone is up for review? [10:51:31] https://gerrit.wikimedia.org/r/c/operations/puppet/+/851003 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/851004 [10:51:45] jbond: ^ maybe if you are around ? [10:53:07] (03PS5) 10Jbond: R:rsync::manifests::server::module: add type validation [puppet] - 10https://gerrit.wikimedia.org/r/850171 [10:53:09] (03PS4) 10Jbond: R:rsync::manifests::server::module: Strengthen types [puppet] - 10https://gerrit.wikimedia.org/r/850172 [10:53:11] (03PS2) 10Jbond: rsync::server::module: drop auto_ferm_ipv6 parameter [puppet] - 10https://gerrit.wikimedia.org/r/850173 [10:53:18] godog: looking [10:54:13] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:54:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T318955)', diff saved to https://phabricator.wikimedia.org/P37106 and previous config saved to /var/cache/conftool/dbconfig/20221031-105418-ladsgroup.json [10:54:24] jbond: cheers, appreciate it [10:54:25] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [10:54:44] (03CR) 10CI reject: [V: 04-1] rsync::server::module: drop auto_ferm_ipv6 parameter [puppet] - 10https://gerrit.wikimedia.org/r/850173 (owner: 10Jbond) [10:54:59] (03CR) 10Jbond: [C: 04-1] hieradata: add dispatch.w.o to IDP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/851003 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [10:55:20] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/851004 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [10:55:34] (03CR) 10Matthias Mullie: [C: 03+1] Update i18n for ca, nb, fi & hu [extensions/ImageSuggestions] (wmf/1.40.0-wmf.7) - 10https://gerrit.wikimedia.org/r/850985 (https://phabricator.wikimedia.org/T300064) (owner: 10Matthias Mullie) [10:56:01] (03CR) 10Matthias Mullie: [C: 03+1] Enable ImageSuggestions on ca, no, fi & huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850425 (https://phabricator.wikimedia.org/T300064) (owner: 10Matthias Mullie) [10:56:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T321123)', diff saved to https://phabricator.wikimedia.org/P37107 and previous config saved to /var/cache/conftool/dbconfig/20221031-105605-marostegui.json [10:56:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2180.codfw.wmnet with reason: Maintenance [10:56:13] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [10:56:21] godog: done, let me know if you need to talk abouyt the dispatch service but from cas point they are both the same services as the have the same service_id/url [10:56:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2180.codfw.wmnet with reason: Maintenance [10:56:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P37108 and previous config saved to /var/cache/conftool/dbconfig/20221031-105625-ladsgroup.json [10:56:28] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 101 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:58:04] jbond: doh! of course, copy/pasta typo on my end [10:58:10] will fix [10:58:13] ack [10:58:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T321123)', diff saved to https://phabricator.wikimedia.org/P37109 and previous config saved to /var/cache/conftool/dbconfig/20221031-105818-marostegui.json [10:58:28] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:59:13] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:59:42] a small gerrit spam incoming, sorry about that [10:59:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T318955)', diff saved to https://phabricator.wikimedia.org/P37110 and previous config saved to /var/cache/conftool/dbconfig/20221031-105941-ladsgroup.json [10:59:43] (03PS2) 10Filippo Giunchedi: hieradata: add dispatch.w.o to IDP [puppet] - 10https://gerrit.wikimedia.org/r/851003 (https://phabricator.wikimedia.org/T313229) [10:59:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2121.codfw.wmnet with reason: Maintenance [10:59:45] (03PS2) 10Filippo Giunchedi: hieradata: fix Prometheus IDP entry [puppet] - 10https://gerrit.wikimedia.org/r/851004 (https://phabricator.wikimedia.org/T313229) [10:59:47] (03PS3) 10Filippo Giunchedi: hieradata: don't monitor /run/docker on alerting_host [puppet] - 10https://gerrit.wikimedia.org/r/850993 (https://phabricator.wikimedia.org/T313229) [10:59:48] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [10:59:49] (03PS2) 10Filippo Giunchedi: dispatch: enforce ssl for dispatch DB user [puppet] - 10https://gerrit.wikimedia.org/r/850999 (https://phabricator.wikimedia.org/T313229) [10:59:51] (03PS2) 10Filippo Giunchedi: dispatch: run the scheduler on active host only [puppet] - 10https://gerrit.wikimedia.org/r/851000 (https://phabricator.wikimedia.org/T313229) [10:59:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2121.codfw.wmnet with reason: Maintenance [11:00:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2121 (T318955)', diff saved to https://phabricator.wikimedia.org/P37111 and previous config saved to /var/cache/conftool/dbconfig/20221031-110003-ladsgroup.json [11:00:23] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/851003 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [11:02:40] cheers jbond [11:02:49] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: fix Prometheus IDP entry [puppet] - 10https://gerrit.wikimedia.org/r/851004 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [11:02:52] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: add dispatch.w.o to IDP [puppet] - 10https://gerrit.wikimedia.org/r/851003 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [11:03:28] matthiasmullie: o/ I noticed some mw errors in logstash related to mwmaint1002 - https://logstash.wikimedia.org/goto/1fa0b9eff2116a80b9535899ab643468 [11:03:34] np [11:04:41] (03PS6) 10Jbond: R:rsync::manifests::server::module: add type validation [puppet] - 10https://gerrit.wikimedia.org/r/850171 [11:04:43] (03PS5) 10Jbond: R:rsync::manifests::server::module: Strengthen types [puppet] - 10https://gerrit.wikimedia.org/r/850172 [11:04:45] (03PS3) 10Jbond: rsync::server::module: drop auto_ferm_ipv6 parameter [puppet] - 10https://gerrit.wikimedia.org/r/850173 [11:04:47] (03PS1) 10Jbond: P:openstack: use yesy vs true for read_only parameter [puppet] - 10https://gerrit.wikimedia.org/r/851046 [11:09:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P37112 and previous config saved to /var/cache/conftool/dbconfig/20221031-110925-ladsgroup.json [11:10:06] PROBLEM - SSH on mw1319.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:10:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T318955)', diff saved to https://phabricator.wikimedia.org/P37113 and previous config saved to /var/cache/conftool/dbconfig/20221031-111058-ladsgroup.json [11:11:05] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [11:11:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T318605)', diff saved to https://phabricator.wikimedia.org/P37114 and previous config saved to /var/cache/conftool/dbconfig/20221031-111132-ladsgroup.json [11:11:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1129.eqiad.wmnet with reason: Maintenance [11:11:37] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [11:11:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1129.eqiad.wmnet with reason: Maintenance [11:11:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T318605)', diff saved to https://phabricator.wikimedia.org/P37115 and previous config saved to /var/cache/conftool/dbconfig/20221031-111153-ladsgroup.json [11:12:54] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:13:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P37116 and previous config saved to /var/cache/conftool/dbconfig/20221031-111324-marostegui.json [11:16:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [11:16:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [11:16:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T318950)', diff saved to https://phabricator.wikimedia.org/P37117 and previous config saved to /var/cache/conftool/dbconfig/20221031-111641-ladsgroup.json [11:16:47] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [11:17:12] (03PS2) 10Jbond: P:openstack: use yesy vs true for read_only parameter [puppet] - 10https://gerrit.wikimedia.org/r/851046 [11:17:14] (03PS7) 10Jbond: R:rsync::manifests::server::module: add type validation [puppet] - 10https://gerrit.wikimedia.org/r/850171 [11:17:16] (03PS6) 10Jbond: R:rsync::manifests::server::module: Strengthen types [puppet] - 10https://gerrit.wikimedia.org/r/850172 [11:17:18] (03PS4) 10Jbond: rsync::server::module: drop auto_ferm_ipv6 parameter [puppet] - 10https://gerrit.wikimedia.org/r/850173 [11:17:20] (03PS1) 10Jbond: wmflib::Host_or_network: add new type [puppet] - 10https://gerrit.wikimedia.org/r/851049 [11:17:22] (03PS1) 10Jbond: P:ci:data_sync: hosts_allow expects an array [puppet] - 10https://gerrit.wikimedia.org/r/851050 [11:18:25] (03PS7) 10Jbond: R:rsync::manifests::server::module: Strengthen types [puppet] - 10https://gerrit.wikimedia.org/r/850172 [11:18:51] (03CR) 10Jbond: [C: 03+2] wmflib::Host_or_network: add new type [puppet] - 10https://gerrit.wikimedia.org/r/851049 (owner: 10Jbond) [11:18:56] (03CR) 10Jbond: [C: 03+2] P:ci:data_sync: hosts_allow expects an array [puppet] - 10https://gerrit.wikimedia.org/r/851050 (owner: 10Jbond) [11:19:09] (03CR) 10Jbond: [C: 03+2] P:openstack: use yesy vs true for read_only parameter [puppet] - 10https://gerrit.wikimedia.org/r/851046 (owner: 10Jbond) [11:21:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T318605)', diff saved to https://phabricator.wikimedia.org/P37118 and previous config saved to /var/cache/conftool/dbconfig/20221031-112125-ladsgroup.json [11:21:31] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [11:24:28] RECOVERY - SSH on mw1334.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:24:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P37119 and previous config saved to /var/cache/conftool/dbconfig/20221031-112431-ladsgroup.json [11:25:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T318605)', diff saved to https://phabricator.wikimedia.org/P37120 and previous config saved to /var/cache/conftool/dbconfig/20221031-112523-ladsgroup.json [11:25:44] (03CR) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [11:26:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P37121 and previous config saved to /var/cache/conftool/dbconfig/20221031-112605-ladsgroup.json [11:28:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P37122 and previous config saved to /var/cache/conftool/dbconfig/20221031-112831-marostegui.json [11:34:14] (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:36:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P37123 and previous config saved to /var/cache/conftool/dbconfig/20221031-113631-ladsgroup.json [11:36:34] (03PS8) 10Jbond: R:rsync::manifests::server::module: add type validation [puppet] - 10https://gerrit.wikimedia.org/r/850171 [11:36:36] (03PS8) 10Jbond: R:rsync::manifests::server::module: Strengthen types [puppet] - 10https://gerrit.wikimedia.org/r/850172 [11:36:38] (03PS5) 10Jbond: rsync::server::module: drop auto_ferm_ipv6 parameter [puppet] - 10https://gerrit.wikimedia.org/r/850173 [11:36:40] (03PS1) 10Jbond: P:gerrit::migration: use array for allow hosts [puppet] - 10https://gerrit.wikimedia.org/r/851051 [11:38:07] (03PS2) 10Jbond: P:gerrit::migration: use array for allow hosts [puppet] - 10https://gerrit.wikimedia.org/r/851051 [11:38:39] (03CR) 10CI reject: [V: 04-1] rsync::server::module: drop auto_ferm_ipv6 parameter [puppet] - 10https://gerrit.wikimedia.org/r/850173 (owner: 10Jbond) [11:39:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T318955)', diff saved to https://phabricator.wikimedia.org/P37124 and previous config saved to /var/cache/conftool/dbconfig/20221031-113938-ladsgroup.json [11:39:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1136.eqiad.wmnet with reason: Maintenance [11:39:45] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [11:39:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1136.eqiad.wmnet with reason: Maintenance [11:40:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T318955)', diff saved to https://phabricator.wikimedia.org/P37125 and previous config saved to /var/cache/conftool/dbconfig/20221031-113959-ladsgroup.json [11:40:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P37126 and previous config saved to /var/cache/conftool/dbconfig/20221031-114030-ladsgroup.json [11:41:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P37127 and previous config saved to /var/cache/conftool/dbconfig/20221031-114111-ladsgroup.json [11:43:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T321123)', diff saved to https://phabricator.wikimedia.org/P37128 and previous config saved to /var/cache/conftool/dbconfig/20221031-114337-marostegui.json [11:43:44] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [11:44:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1096.eqiad.wmnet with reason: Maintenance [11:44:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1096.eqiad.wmnet with reason: Maintenance [11:44:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T321123)', diff saved to https://phabricator.wikimedia.org/P37129 and previous config saved to /var/cache/conftool/dbconfig/20221031-114443-marostegui.json [11:46:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T321123)', diff saved to https://phabricator.wikimedia.org/P37130 and previous config saved to /var/cache/conftool/dbconfig/20221031-114652-marostegui.json [11:51:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P37131 and previous config saved to /var/cache/conftool/dbconfig/20221031-115138-ladsgroup.json [11:52:36] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:53:22] elukey: thanks for the ping - it's harmless, was just trying to write stuff in a place the script couldn't (and it was quiet about it; hence the multiple entries :p) [11:55:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P37132 and previous config saved to /var/cache/conftool/dbconfig/20221031-115536-ladsgroup.json [11:56:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T318955)', diff saved to https://phabricator.wikimedia.org/P37133 and previous config saved to /var/cache/conftool/dbconfig/20221031-115618-ladsgroup.json [11:56:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2122.codfw.wmnet with reason: Maintenance [11:56:25] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [11:56:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2122.codfw.wmnet with reason: Maintenance [11:56:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2122 (T318955)', diff saved to https://phabricator.wikimedia.org/P37134 and previous config saved to /var/cache/conftool/dbconfig/20221031-115639-ladsgroup.json [11:59:49] (03PS1) 10KartikMistry: Update cxserver to 2022-10-31-083825-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/851053 (https://phabricator.wikimedia.org/T225494) [12:01:14] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [12:01:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P37135 and previous config saved to /var/cache/conftool/dbconfig/20221031-120158-marostegui.json [12:02:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [12:06:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T318605)', diff saved to https://phabricator.wikimedia.org/P37136 and previous config saved to /var/cache/conftool/dbconfig/20221031-120644-ladsgroup.json [12:06:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance [12:06:51] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [12:07:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance [12:08:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T318955)', diff saved to https://phabricator.wikimedia.org/P37137 and previous config saved to /var/cache/conftool/dbconfig/20221031-120807-ladsgroup.json [12:08:13] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [12:10:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T318605)', diff saved to https://phabricator.wikimedia.org/P37138 and previous config saved to /var/cache/conftool/dbconfig/20221031-121043-ladsgroup.json [12:10:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: Maintenance [12:10:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: Maintenance [12:11:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [12:11:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [12:11:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2126 (T318605)', diff saved to https://phabricator.wikimedia.org/P37139 and previous config saved to /var/cache/conftool/dbconfig/20221031-121108-ladsgroup.json [12:12:43] (03PS1) 10Jbond: nodegen: add a new selector for puppet resources [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/851054 [12:14:15] (03CR) 10CI reject: [V: 04-1] nodegen: add a new selector for puppet resources [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/851054 (owner: 10Jbond) [12:14:48] (03CR) 10Jbond: [C: 03+2] Stop installing the base packages list for now [puppet] - 10https://gerrit.wikimedia.org/r/850508 (owner: 10Muehlenhoff) [12:16:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T318950)', diff saved to https://phabricator.wikimedia.org/P37140 and previous config saved to /var/cache/conftool/dbconfig/20221031-121658-ladsgroup.json [12:17:05] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [12:17:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P37141 and previous config saved to /var/cache/conftool/dbconfig/20221031-121705-marostegui.json [12:18:13] !log repooling wdqs1007 - catched up on lag - T322010 [12:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:19] T322010: Depool wdqs1007 - https://phabricator.wikimedia.org/T322010 [12:18:42] (03CR) 10Jbond: [C: 03+2] P:gerrit::migration: use array for allow hosts [puppet] - 10https://gerrit.wikimedia.org/r/851051 (owner: 10Jbond) [12:19:02] (03PS9) 10Jbond: R:rsync::manifests::server::module: add type validation [puppet] - 10https://gerrit.wikimedia.org/r/850171 [12:19:07] (03PS2) 10Stang: Revert "votewiki: Change wgLanguageCode to zh for Sep 2022 admins election" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850577 [12:19:11] (03PS9) 10Jbond: R:rsync::manifests::server::module: Strengthen types [puppet] - 10https://gerrit.wikimedia.org/r/850172 [12:19:21] (03PS6) 10Jbond: rsync::server::module: drop auto_ferm_ipv6 parameter [puppet] - 10https://gerrit.wikimedia.org/r/850173 [12:21:08] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01684 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [12:21:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T318605)', diff saved to https://phabricator.wikimedia.org/P37142 and previous config saved to /var/cache/conftool/dbconfig/20221031-122109-ladsgroup.json [12:21:16] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [12:23:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P37143 and previous config saved to /var/cache/conftool/dbconfig/20221031-122314-ladsgroup.json [12:25:20] * kart_ updating cxserver (attempt 2) [12:25:56] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2022-10-31-083825-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/851053 (https://phabricator.wikimedia.org/T225494) (owner: 10KartikMistry) [12:29:33] (03PS1) 10Btullis: Add a namespace for the stream-enrichment-poc on dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/851063 (https://phabricator.wikimedia.org/T321682) [12:29:35] (03PS1) 10Slyngshede: C:idm::deployment of IDM. [puppet] - 10https://gerrit.wikimedia.org/r/851064 (https://phabricator.wikimedia.org/T320428) [12:29:44] (03Merged) 10jenkins-bot: Update cxserver to 2022-10-31-083825-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/851053 (https://phabricator.wikimedia.org/T225494) (owner: 10KartikMistry) [12:32:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P37144 and previous config saved to /var/cache/conftool/dbconfig/20221031-123204-ladsgroup.json [12:32:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T321123)', diff saved to https://phabricator.wikimedia.org/P37145 and previous config saved to /var/cache/conftool/dbconfig/20221031-123211-marostegui.json [12:32:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1098.eqiad.wmnet with reason: Maintenance [12:32:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1098.eqiad.wmnet with reason: Maintenance [12:32:18] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [12:32:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T321123)', diff saved to https://phabricator.wikimedia.org/P37146 and previous config saved to /var/cache/conftool/dbconfig/20221031-123222-marostegui.json [12:33:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T321123)', diff saved to https://phabricator.wikimedia.org/P37147 and previous config saved to /var/cache/conftool/dbconfig/20221031-123330-marostegui.json [12:33:33] (03CR) 10CI reject: [V: 04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/851065 (owner: 10L10n-bot) [12:36:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P37148 and previous config saved to /var/cache/conftool/dbconfig/20221031-123616-ladsgroup.json [12:36:19] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [12:36:48] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [12:38:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P37149 and previous config saved to /var/cache/conftool/dbconfig/20221031-123822-ladsgroup.json [12:40:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T318955)', diff saved to https://phabricator.wikimedia.org/P37150 and previous config saved to /var/cache/conftool/dbconfig/20221031-124015-ladsgroup.json [12:40:22] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [12:41:14] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [12:42:13] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [12:46:35] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [12:47:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P37151 and previous config saved to /var/cache/conftool/dbconfig/20221031-124711-ladsgroup.json [12:47:14] (03PS2) 10Slyngshede: C:idm::deployment of IDM. [puppet] - 10https://gerrit.wikimedia.org/r/851064 (https://phabricator.wikimedia.org/T320428) [12:47:35] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [12:48:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P37152 and previous config saved to /var/cache/conftool/dbconfig/20221031-124836-marostegui.json [12:48:41] Updated cxserver to 2022-10-31-083825-production (T225494, T314836, T295545, T319176) [12:48:42] T314836: Images get a link= to raw thumbnail url - https://phabricator.wikimedia.org/T314836 [12:48:42] T295545: Explain the template mapping status - https://phabricator.wikimedia.org/T295545 [12:48:42] T319176: Enable Section Translation on 9 Wikipedias where Content Translation is available by default - https://phabricator.wikimedia.org/T319176 [12:48:43] T225494: Translate the initial title automatically - https://phabricator.wikimedia.org/T225494 [12:50:48] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.002972 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [12:51:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P37153 and previous config saved to /var/cache/conftool/dbconfig/20221031-125123-ladsgroup.json [12:53:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T318955)', diff saved to https://phabricator.wikimedia.org/P37154 and previous config saved to /var/cache/conftool/dbconfig/20221031-125329-ladsgroup.json [12:53:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2150.codfw.wmnet with reason: Maintenance [12:53:35] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [12:53:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2150.codfw.wmnet with reason: Maintenance [12:53:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2150 (T318955)', diff saved to https://phabricator.wikimedia.org/P37155 and previous config saved to /var/cache/conftool/dbconfig/20221031-125350-ladsgroup.json [12:54:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [12:55:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [12:55:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T318605)', diff saved to https://phabricator.wikimedia.org/P37156 and previous config saved to /var/cache/conftool/dbconfig/20221031-125509-ladsgroup.json [12:55:15] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [12:55:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P37157 and previous config saved to /var/cache/conftool/dbconfig/20221031-125521-ladsgroup.json [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221031T1300). [13:00:04] matthiasmullie and koi: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:08] o/ [13:00:13] o/ [13:00:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1098:3316', diff saved to https://phabricator.wikimedia.org/P37158 and previous config saved to /var/cache/conftool/dbconfig/20221031-130016-marostegui.json [13:00:57] I can deploy in a few [13:01:23] * urbanecm doesn't like DST shifting meetings/windows [13:01:47] (03CR) 10Urbanecm: [C: 03+2] Update i18n for ca, nb, fi & hu [extensions/ImageSuggestions] (wmf/1.40.0-wmf.7) - 10https://gerrit.wikimedia.org/r/850985 (https://phabricator.wikimedia.org/T300064) (owner: 10Matthias Mullie) [13:02:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T318950)', diff saved to https://phabricator.wikimedia.org/P37159 and previous config saved to /var/cache/conftool/dbconfig/20221031-130217-ladsgroup.json [13:02:24] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [13:02:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1098:3316', diff saved to https://phabricator.wikimedia.org/P37160 and previous config saved to /var/cache/conftool/dbconfig/20221031-130244-marostegui.json [13:03:36] (03PS1) 10Daimona Eaytoy: Remove $wgCampaignEventsDatabaseName [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851078 (https://phabricator.wikimedia.org/T318592) [13:04:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T318955)', diff saved to https://phabricator.wikimedia.org/P37161 and previous config saved to /var/cache/conftool/dbconfig/20221031-130454-ladsgroup.json [13:05:01] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [13:06:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T318605)', diff saved to https://phabricator.wikimedia.org/P37162 and previous config saved to /var/cache/conftool/dbconfig/20221031-130629-ladsgroup.json [13:06:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [13:06:36] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [13:06:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [13:06:47] (03PS1) 10Daimona Eaytoy: Enable the CampaignEvents extension on test(2)wiki and officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851079 (https://phabricator.wikimedia.org/T318592) [13:06:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3312 (T318605)', diff saved to https://phabricator.wikimedia.org/P37163 and previous config saved to /var/cache/conftool/dbconfig/20221031-130651-ladsgroup.json [13:06:56] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1363 is CRITICAL: etcd last index (1208983) is outdated compared to the master one (1208986) https://wikitech.wikimedia.org/wiki/Etcd [13:06:58] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1451 is CRITICAL: etcd last index (1208983) is outdated compared to the master one (1208986) https://wikitech.wikimedia.org/wiki/Etcd [13:06:58] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2402 is CRITICAL: etcd last index (1647783) is outdated compared to the master one (1647789) https://wikitech.wikimedia.org/wiki/Etcd [13:06:58] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2328 is CRITICAL: etcd last index (1647783) is outdated compared to the master one (1647789) https://wikitech.wikimedia.org/wiki/Etcd [13:06:58] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1360 is CRITICAL: etcd last index (1208983) is outdated compared to the master one (1208986) https://wikitech.wikimedia.org/wiki/Etcd [13:07:00] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1395 is CRITICAL: etcd last index (1208983) is outdated compared to the master one (1208986) https://wikitech.wikimedia.org/wiki/Etcd [13:07:00] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2316 is CRITICAL: etcd last index (1647783) is outdated compared to the master one (1647789) https://wikitech.wikimedia.org/wiki/Etcd [13:07:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10Jclark-ctr) @fnegri Thursday would work best for me [13:08:16] (03Merged) 10jenkins-bot: Update i18n for ca, nb, fi & hu [extensions/ImageSuggestions] (wmf/1.40.0-wmf.7) - 10https://gerrit.wikimedia.org/r/850985 (https://phabricator.wikimedia.org/T300064) (owner: 10Matthias Mullie) [13:08:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1098:3316', diff saved to https://phabricator.wikimedia.org/P37164 and previous config saved to /var/cache/conftool/dbconfig/20221031-130834-marostegui.json [13:09:44] sorry for the delay [13:09:45] let's start [13:09:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/ImageSuggestions] (wmf/1.40.0-wmf.7) - 10https://gerrit.wikimedia.org/r/850985 (https://phabricator.wikimedia.org/T300064) (owner: 10Matthias Mullie) [13:10:09] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:850985|Update i18n for ca, nb, fi & hu (T300064)]] [13:10:15] T300064: [S] Schedule image suggestions notifications in more wikis - https://phabricator.wikimedia.org/T300064 [13:10:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P37165 and previous config saved to /var/cache/conftool/dbconfig/20221031-131028-ladsgroup.json [13:10:41] !log urbanecm@deploy1002 urbanecm and mlitn: Backport for [[gerrit:850985|Update i18n for ca, nb, fi & hu (T300064)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [13:10:49] matthiasmullie: can you check at mwdebug1001 please? [13:10:54] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1363 is OK: etcd last index (1208992) matches the master one (1208992) https://wikitech.wikimedia.org/wiki/Etcd [13:10:54] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1451 is OK: etcd last index (1208992) matches the master one (1208992) https://wikitech.wikimedia.org/wiki/Etcd [13:10:54] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2402 is OK: etcd last index (1647801) matches the master one (1647801) https://wikitech.wikimedia.org/wiki/Etcd [13:10:54] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2328 is OK: etcd last index (1647801) matches the master one (1647801) https://wikitech.wikimedia.org/wiki/Etcd [13:10:55] (03CR) 10Urbanecm: [C: 03+2] Enable ImageSuggestions on ca, no, fi & huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850425 (https://phabricator.wikimedia.org/T300064) (owner: 10Matthias Mullie) [13:10:56] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1360 is OK: etcd last index (1208992) matches the master one (1208992) https://wikitech.wikimedia.org/wiki/Etcd [13:10:56] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1395 is OK: etcd last index (1208992) matches the master one (1208992) https://wikitech.wikimedia.org/wiki/Etcd [13:10:58] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2316 is OK: etcd last index (1647801) matches the master one (1647801) https://wikitech.wikimedia.org/wiki/Etcd [13:11:18] checking [13:11:48] (03Merged) 10jenkins-bot: Enable ImageSuggestions on ca, no, fi & huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850425 (https://phabricator.wikimedia.org/T300064) (owner: 10Matthias Mullie) [13:11:56] RECOVERY - SSH on mw1319.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:12:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [13:12:31] urbanecm: LGTM! [13:12:34] syncing! [13:14:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [13:14:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [13:15:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [13:16:53] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:850985|Update i18n for ca, nb, fi & hu (T300064)]] (duration: 06m 43s) [13:17:05] just in time, bots, just in time [13:17:08] anyway, matthiasmullie: done :) [13:17:12] Thanks! [13:17:14] doing second one [13:17:16] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850425 (https://phabricator.wikimedia.org/T300064) (owner: 10Matthias Mullie) [13:17:27] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:850425|Enable ImageSuggestions on ca, no, fi & huwiki (T300064)]] [13:17:47] !log urbanecm@deploy1002 urbanecm and mlitn: Backport for [[gerrit:850425|Enable ImageSuggestions on ca, no, fi & huwiki (T300064)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [13:17:54] matthiasmullie: can you check at mwdebug1001, please? [13:19:13] urbanecm: LGTM [13:19:17] great, syncing [13:20:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P37166 and previous config saved to /var/cache/conftool/dbconfig/20221031-132000-ladsgroup.json [13:20:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [13:21:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [13:21:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [13:21:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T318950)', diff saved to https://phabricator.wikimedia.org/P37167 and previous config saved to /var/cache/conftool/dbconfig/20221031-132145-ladsgroup.json [13:22:57] (03CR) 10Urbanecm: [C: 03+2] Revert "votewiki: Change wgLanguageCode to zh for Sep 2022 admins election" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850577 (owner: 10Stang) [13:23:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [13:23:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [13:23:09] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:850425|Enable ImageSuggestions on ca, no, fi & huwiki (T300064)]] (duration: 05m 42s) [13:23:57] matthiasmullie: both done now [13:24:02] urbanecm: thanks! [13:24:05] (03PS3) 10Urbanecm: Revert "votewiki: Change wgLanguageCode to zh for Sep 2022 admins election" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850577 (owner: 10Stang) [13:24:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [13:24:11] no problem [13:24:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850577 (owner: 10Stang) [13:24:58] (03Merged) 10jenkins-bot: Revert "votewiki: Change wgLanguageCode to zh for Sep 2022 admins election" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850577 (owner: 10Stang) [13:25:01] (03CR) 10Matthias Mullie: [C: 03+1] "Has been confirmed with communities; this is good to go - ideally before Wed 2 Oct (so that's either today or tomorrow)" [puppet] - 10https://gerrit.wikimedia.org/r/850446 (https://phabricator.wikimedia.org/T300064) (owner: 10Matthias Mullie) [13:25:14] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:850577|Revert "votewiki: Change wgLanguageCode to zh for Sep 2022 admins election"]] [13:25:33] !log urbanecm@deploy1002 urbanecm and stang: Backport for [[gerrit:850577|Revert "votewiki: Change wgLanguageCode to zh for Sep 2022 admins election"]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [13:25:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T318955)', diff saved to https://phabricator.wikimedia.org/P37168 and previous config saved to /var/cache/conftool/dbconfig/20221031-132534-ladsgroup.json [13:25:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1158.eqiad.wmnet with reason: Maintenance [13:25:38] hi, can i trouble someone to restart stashbot? https://wikitech.wikimedia.org/wiki/Tool:Stashbot#Maintenance are docs [13:25:43] koi: your patch is at mwdebug1001, can you check? [13:25:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1158.eqiad.wmnet with reason: Maintenance [13:25:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:26:01] urbanecm: default language changed back to English, so LGTM [13:26:06] syncing [13:26:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:26:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T318955)', diff saved to https://phabricator.wikimedia.org/P37169 and previous config saved to /var/cache/conftool/dbconfig/20221031-132613-ladsgroup.json [13:28:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T318955)', diff saved to https://phabricator.wikimedia.org/P37170 and previous config saved to /var/cache/conftool/dbconfig/20221031-132823-ladsgroup.json [13:28:59] (03CR) 10Ssingh: [C: 03+2] Revert "Depool ulsfo for cp hosts hardware refresh" [dns] - 10https://gerrit.wikimedia.org/r/850436 (owner: 10Ssingh) [13:29:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [13:29:16] (03PS2) 10Ssingh: Revert "Depool ulsfo for cp hosts hardware refresh" [dns] - 10https://gerrit.wikimedia.org/r/850436 [13:29:59] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:850577|Revert "votewiki: Change wgLanguageCode to zh for Sep 2022 admins election"]] (duration: 04m 45s) [13:30:04] koi: and should be live [13:30:07] anything else, anyone? [13:30:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [13:30:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [13:31:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [13:31:34] !log running authdns-update for pooling ulsfo: T850436 [13:31:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T318950)', diff saved to https://phabricator.wikimedia.org/P37171 and previous config saved to /var/cache/conftool/dbconfig/20221031-133140-ladsgroup.json [13:32:04] 10SRE, 10LDAP-Access-Requests: Grant Access to Superset for Hibashaath - https://phabricator.wikimedia.org/T321902 (10TAndic) Adding here as well as T321903 was merged: approving access as @HShaath-WMF's direct manager. [13:34:14] (03PS4) 10Jbond: (WIP) nodegen: add node selections based on commited files [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651925 (https://phabricator.wikimedia.org/T166066) [13:35:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P37172 and previous config saved to /var/cache/conftool/dbconfig/20221031-133507-ladsgroup.json [13:35:41] (03CR) 10CI reject: [V: 04-1] (WIP) nodegen: add node selections based on commited files [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651925 (https://phabricator.wikimedia.org/T166066) (owner: 10Jbond) [13:38:34] (03PS5) 10Jbond: (WIP) nodegen: add node selections based on commited files [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651925 (https://phabricator.wikimedia.org/T166066) [13:40:44] (03CR) 10CI reject: [V: 04-1] (WIP) nodegen: add node selections based on commited files [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651925 (https://phabricator.wikimedia.org/T166066) (owner: 10Jbond) [13:43:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P37173 and previous config saved to /var/cache/conftool/dbconfig/20221031-134329-ladsgroup.json [13:43:48] (03CR) 10Subramanya Sastry: "Adding Sergio and removing myself." [puppet] - 10https://gerrit.wikimedia.org/r/850160 (https://phabricator.wikimedia.org/T321722) (owner: 10Awight) [13:45:36] RECOVERY - PyBal backends health check on lvs4006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:45:54] godog: ^ thanks! [13:46:03] should now be clearing up [13:46:24] RECOVERY - PyBal backends health check on lvs4007 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:46:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P37174 and previous config saved to /var/cache/conftool/dbconfig/20221031-134647-ladsgroup.json [13:48:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: (2) Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [13:49:07] dcausse: ^ [13:49:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T318605)', diff saved to https://phabricator.wikimedia.org/P37175 and previous config saved to /var/cache/conftool/dbconfig/20221031-134911-ladsgroup.json [13:49:18] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [13:50:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T318955)', diff saved to https://phabricator.wikimedia.org/P37176 and previous config saved to /var/cache/conftool/dbconfig/20221031-135013-ladsgroup.json [13:50:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2159.codfw.wmnet with reason: Maintenance [13:50:20] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [13:50:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2159.codfw.wmnet with reason: Maintenance [13:50:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db2095.codfw.wmnet with reason: Maintenance [13:50:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2095.codfw.wmnet with reason: Maintenance [13:50:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2159 (T318955)', diff saved to https://phabricator.wikimedia.org/P37177 and previous config saved to /var/cache/conftool/dbconfig/20221031-135039-ladsgroup.json [13:53:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: (2) Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [13:54:46] gehel: looking [13:55:14] dcausse You mind if we jump on a Meet about this? I'd like to learn more about how to troubleshoot this [13:55:29] inflatador: sure [13:55:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10fnegri) @Jclark-ctr Thursday works for me! I will aim to get the server depooled by 11:00 UTC on Thursday, and will post an update in this task w... [13:58:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P37178 and previous config saved to /var/cache/conftool/dbconfig/20221031-135836-ladsgroup.json [13:59:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T318605)', diff saved to https://phabricator.wikimedia.org/P37179 and previous config saved to /var/cache/conftool/dbconfig/20221031-135929-ladsgroup.json [13:59:36] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [14:01:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P37180 and previous config saved to /var/cache/conftool/dbconfig/20221031-140153-ladsgroup.json [14:03:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T318955)', diff saved to https://phabricator.wikimedia.org/P37181 and previous config saved to /var/cache/conftool/dbconfig/20221031-140259-ladsgroup.json [14:03:06] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [14:04:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P37182 and previous config saved to /var/cache/conftool/dbconfig/20221031-140417-ladsgroup.json [14:06:10] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 118 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:08:08] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:09:00] (JobUnavailable) firing: Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:12:22] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:13:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T318955)', diff saved to https://phabricator.wikimedia.org/P37183 and previous config saved to /var/cache/conftool/dbconfig/20221031-141342-ladsgroup.json [14:13:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance [14:13:49] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [14:13:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance [14:14:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T318955)', diff saved to https://phabricator.wikimedia.org/P37184 and previous config saved to /var/cache/conftool/dbconfig/20221031-141404-ladsgroup.json [14:14:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P37185 and previous config saved to /var/cache/conftool/dbconfig/20221031-141436-ladsgroup.json [14:17:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T318950)', diff saved to https://phabricator.wikimedia.org/P37186 and previous config saved to /var/cache/conftool/dbconfig/20221031-141701-ladsgroup.json [14:17:07] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [14:18:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P37187 and previous config saved to /var/cache/conftool/dbconfig/20221031-141806-ladsgroup.json [14:18:52] (03PS6) 10Jbond: (WIP) nodegen: add node selections based on commited files [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651925 (https://phabricator.wikimedia.org/T166066) [14:19:10] (03CR) 10Elukey: Add a namespace for the stream-enrichment-poc on dse-k8s (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/851063 (https://phabricator.wikimedia.org/T321682) (owner: 10Btullis) [14:19:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P37188 and previous config saved to /var/cache/conftool/dbconfig/20221031-141924-ladsgroup.json [14:20:15] (03PS7) 10Jbond: (WIP) nodegen: add node selections based on commited files [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651925 (https://phabricator.wikimedia.org/T166066) [14:22:21] (03CR) 10CI reject: [V: 04-1] (WIP) nodegen: add node selections based on commited files [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651925 (https://phabricator.wikimedia.org/T166066) (owner: 10Jbond) [14:22:48] (03PS10) 10Jbond: R:rsync::manifests::server::module: Strengthen types [puppet] - 10https://gerrit.wikimedia.org/r/850172 [14:24:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T318955)', diff saved to https://phabricator.wikimedia.org/P37189 and previous config saved to /var/cache/conftool/dbconfig/20221031-142400-ladsgroup.json [14:24:08] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [14:25:16] (03PS7) 10Jbond: rsync::server::module: drop auto_ferm_ipv6 parameter [puppet] - 10https://gerrit.wikimedia.org/r/850173 [14:26:16] (03PS8) 10Jbond: (WIP) nodegen: add node selections based on commited files [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651925 (https://phabricator.wikimedia.org/T166066) [14:26:52] (03CR) 10Elukey: [C: 03+2] Set coredns 1.8.7 for ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/851002 (https://phabricator.wikimedia.org/T321159) (owner: 10Elukey) [14:27:34] (03CR) 10CI reject: [V: 04-1] (WIP) nodegen: add node selections based on commited files [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651925 (https://phabricator.wikimedia.org/T166066) (owner: 10Jbond) [14:29:23] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:29:41] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:29:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P37190 and previous config saved to /var/cache/conftool/dbconfig/20221031-142942-ladsgroup.json [14:33:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P37191 and previous config saved to /var/cache/conftool/dbconfig/20221031-143312-ladsgroup.json [14:34:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T318605)', diff saved to https://phabricator.wikimedia.org/P37192 and previous config saved to /var/cache/conftool/dbconfig/20221031-143430-ladsgroup.json [14:34:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance [14:34:38] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [14:34:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance [14:34:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [14:34:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [14:34:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T318605)', diff saved to https://phabricator.wikimedia.org/P37193 and previous config saved to /var/cache/conftool/dbconfig/20221031-143458-ladsgroup.json [14:35:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 25%: Maint done', diff saved to https://phabricator.wikimedia.org/P37194 and previous config saved to /var/cache/conftool/dbconfig/20221031-143507-ladsgroup.json [14:39:04] (03PS3) 10Filippo Giunchedi: dispatch: run the scheduler on active host only [puppet] - 10https://gerrit.wikimedia.org/r/851000 (https://phabricator.wikimedia.org/T313229) [14:39:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P37195 and previous config saved to /var/cache/conftool/dbconfig/20221031-143906-ladsgroup.json [14:41:08] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: explicitly accept VRRP packets from the peer [puppet] - 10https://gerrit.wikimedia.org/r/851087 (https://phabricator.wikimedia.org/T320975) [14:44:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T318605)', diff saved to https://phabricator.wikimedia.org/P37196 and previous config saved to /var/cache/conftool/dbconfig/20221031-144449-ladsgroup.json [14:44:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance [14:44:56] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [14:45:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance [14:45:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2148 (T318605)', diff saved to https://phabricator.wikimedia.org/P37197 and previous config saved to /var/cache/conftool/dbconfig/20221031-144511-ladsgroup.json [14:48:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T318955)', diff saved to https://phabricator.wikimedia.org/P37198 and previous config saved to /var/cache/conftool/dbconfig/20221031-144819-ladsgroup.json [14:48:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2168.codfw.wmnet with reason: Maintenance [14:48:26] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [14:48:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2168.codfw.wmnet with reason: Maintenance [14:48:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3317 (T318955)', diff saved to https://phabricator.wikimedia.org/P37199 and previous config saved to /var/cache/conftool/dbconfig/20221031-144840-ladsgroup.json [14:50:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 50%: Maint done', diff saved to https://phabricator.wikimedia.org/P37200 and previous config saved to /var/cache/conftool/dbconfig/20221031-145012-ladsgroup.json [14:51:26] 10SRE, 10ops-codfw, 10Discovery-Search (Current work): Degraded RAID on elastic2052 - https://phabricator.wikimedia.org/T320482 (10Papaul) @Rkemper can you please double check this alert , looking at the disk here looks good to me no sign of failure led is green and showing activity. [14:54:13] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:54:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P37201 and previous config saved to /var/cache/conftool/dbconfig/20221031-145413-ladsgroup.json [14:57:48] (03PS8) 10Ottomata: Declare mediawiki.page_change stream in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849144 (https://phabricator.wikimedia.org/T311129) [14:59:13] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:01:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T318955)', diff saved to https://phabricator.wikimedia.org/P37202 and previous config saved to /var/cache/conftool/dbconfig/20221031-150105-ladsgroup.json [15:01:11] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [15:02:38] (03PS1) 10Jbond: worker: pass error codes for base and change explcitly [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/851090 [15:04:26] (03CR) 10CI reject: [V: 04-1] worker: pass error codes for base and change explcitly [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/851090 (owner: 10Jbond) [15:04:46] (03CR) 10Btullis: Add a namespace for the stream-enrichment-poc on dse-k8s (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/851063 (https://phabricator.wikimedia.org/T321682) (owner: 10Btullis) [15:05:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 75%: Maint done', diff saved to https://phabricator.wikimedia.org/P37203 and previous config saved to /var/cache/conftool/dbconfig/20221031-150517-ladsgroup.json [15:09:09] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/pcc-worker1001/37859/" [puppet] - 10https://gerrit.wikimedia.org/r/851087 (https://phabricator.wikimedia.org/T320975) (owner: 10Arturo Borrero Gonzalez) [15:09:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T318955)', diff saved to https://phabricator.wikimedia.org/P37204 and previous config saved to /var/cache/conftool/dbconfig/20221031-150919-ladsgroup.json [15:09:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance [15:09:26] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 190 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:09:26] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [15:09:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance [15:09:58] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:10:02] PROBLEM - Host elastic2043 is DOWN: PING CRITICAL - Packet loss = 100% [15:10:06] !llog [Elastic] Depooled and shutdown `elastic2043` [15:11:02] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:11:26] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:11:44] (03CR) 10Andrew Bogott: [C: 03+1] cloudgw: explicitly accept VRRP packets from the peer [puppet] - 10https://gerrit.wikimedia.org/r/851087 (https://phabricator.wikimedia.org/T320975) (owner: 10Arturo Borrero Gonzalez) [15:11:54] (03CR) 10Ahmon Dancy: "The plan is to run rsync from inside a container running on a trusted runner, so hopefully this commit will not be needed." [puppet] - 10https://gerrit.wikimedia.org/r/850597 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [15:11:57] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] cloudgw: explicitly accept VRRP packets from the peer [puppet] - 10https://gerrit.wikimedia.org/r/851087 (https://phabricator.wikimedia.org/T320975) (owner: 10Arturo Borrero Gonzalez) [15:16:06] (03CR) 10Jbond: "recheck" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/851054 (owner: 10Jbond) [15:16:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P37205 and previous config saved to /var/cache/conftool/dbconfig/20221031-151612-ladsgroup.json [15:17:00] PROBLEM - Host elastic2043.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:18:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1174.eqiad.wmnet with reason: Maintenance [15:18:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1174.eqiad.wmnet with reason: Maintenance [15:18:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T318955)', diff saved to https://phabricator.wikimedia.org/P37206 and previous config saved to /var/cache/conftool/dbconfig/20221031-151851-ladsgroup.json [15:18:58] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [15:18:58] (03CR) 10Jbond: [C: 03+1] "lgtm will leave someone on wmcs to merge" [puppet] - 10https://gerrit.wikimedia.org/r/850633 (owner: 10Majavah) [15:19:18] (03CR) 10Jbond: [C: 03+2] C:debian: add support for testing [puppet] - 10https://gerrit.wikimedia.org/r/850499 (https://phabricator.wikimedia.org/T321906) (owner: 10Jbond) [15:20:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 100%: Maint done', diff saved to https://phabricator.wikimedia.org/P37207 and previous config saved to /var/cache/conftool/dbconfig/20221031-152022-ladsgroup.json [15:21:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T318955)', diff saved to https://phabricator.wikimedia.org/P37208 and previous config saved to /var/cache/conftool/dbconfig/20221031-152100-ladsgroup.json [15:21:20] RECOVERY - Host elastic2043 is UP: PING OK - Packet loss = 0%, RTA = 31.63 ms [15:22:14] 10SRE, 10ops-codfw, 10Discovery-Search: elastic2043 reported memory errors - https://phabricator.wikimedia.org/T321771 (10Papaul) 05Open→03Resolved DIMM A2 replaced [15:23:02] RECOVERY - Host elastic2043.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.84 ms [15:23:07] (03CR) 10Jbond: C:idm::deployment of IDM. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/851064 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [15:24:12] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Upgrade fasw to Junos 21 - https://phabricator.wikimedia.org/T316542 (10Papaul) [15:27:51] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851092 (https://phabricator.wikimedia.org/T128546) [15:29:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T318605)', diff saved to https://phabricator.wikimedia.org/P37209 and previous config saved to /var/cache/conftool/dbconfig/20221031-152906-ladsgroup.json [15:29:13] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [15:30:47] (03PS1) 10Elukey: ml-services: update docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/851093 (https://phabricator.wikimedia.org/T320374) [15:31:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P37210 and previous config saved to /var/cache/conftool/dbconfig/20221031-153121-ladsgroup.json [15:33:00] (03PS2) 10Jbond: aptrepo: Add component pyall [puppet] - 10https://gerrit.wikimedia.org/r/850093 [15:33:05] (03PS2) 10Jbond: nodegen: add a new selector for puppet resources [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/851054 [15:33:07] (03PS9) 10Jbond: nodegen: add node selections based on commited files [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651925 (https://phabricator.wikimedia.org/T166066) [15:33:09] (03PS2) 10Jbond: worker: pass error codes for base and change explcitly [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/851090 [15:33:11] (03PS1) 10Jbond: pin setuptools [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/851094 [15:33:13] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851092 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:34:03] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851092 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:34:14] (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:34:56] (03CR) 10CI reject: [V: 04-1] pin setuptools [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/851094 (owner: 10Jbond) [15:35:09] (03CR) 10CI reject: [V: 04-1] nodegen: add node selections based on commited files [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651925 (https://phabricator.wikimedia.org/T166066) (owner: 10Jbond) [15:35:11] (03CR) 10CI reject: [V: 04-1] nodegen: add a new selector for puppet resources [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/851054 (owner: 10Jbond) [15:35:44] (03CR) 10CI reject: [V: 04-1] worker: pass error codes for base and change explcitly [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/851090 (owner: 10Jbond) [15:36:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P37211 and previous config saved to /var/cache/conftool/dbconfig/20221031-153607-ladsgroup.json [15:37:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T318605)', diff saved to https://phabricator.wikimedia.org/P37212 and previous config saved to /var/cache/conftool/dbconfig/20221031-153730-ladsgroup.json [15:37:36] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [15:37:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [15:38:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [15:38:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [15:39:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [15:39:47] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:851092| Bumping portals to master (T128546)]] (duration: 03m 43s) [15:39:53] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [15:40:58] (03CR) 10BBlack: [C: 03+1] varnish: Fix identation [puppet] - 10https://gerrit.wikimedia.org/r/829319 (owner: 10Zabe) [15:42:02] (03PS1) 10Ryan Kemper: Revert "admin: ryankemper update shell to zsh" [puppet] - 10https://gerrit.wikimedia.org/r/851011 [15:42:21] (03PS2) 10Ryan Kemper: Revert "admin: ryankemper update shell to zsh" [puppet] - 10https://gerrit.wikimedia.org/r/851011 [15:43:02] (03CR) 10Andrew Bogott: [C: 04-1] "I think we should rip out the drbd monitoring since it's known to be broken currently and we don't plan to use it in the future." [alerts] - 10https://gerrit.wikimedia.org/r/813926 (owner: 10David Caro) [15:43:22] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:851092| Bumping portals to master (T128546)]] (duration: 03m 34s) [15:43:43] (03CR) 10Andrew Bogott: [C: 03+2] toolsdb: disable replication for s54518__mw [puppet] - 10https://gerrit.wikimedia.org/r/810420 (owner: 10Majavah) [15:43:49] (03PS2) 10Andrew Bogott: toolsdb: disable replication for s54518__mw [puppet] - 10https://gerrit.wikimedia.org/r/810420 (owner: 10Majavah) [15:44:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P37213 and previous config saved to /var/cache/conftool/dbconfig/20221031-154413-ladsgroup.json [15:46:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T318955)', diff saved to https://phabricator.wikimedia.org/P37214 and previous config saved to /var/cache/conftool/dbconfig/20221031-154627-ladsgroup.json [15:46:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2169.codfw.wmnet with reason: Maintenance [15:46:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2169.codfw.wmnet with reason: Maintenance [15:46:34] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [15:46:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3317 (T318955)', diff saved to https://phabricator.wikimedia.org/P37215 and previous config saved to /var/cache/conftool/dbconfig/20221031-154638-ladsgroup.json [15:48:16] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "this may break someone workflow. Waiting to collect other +1 before merging." [puppet] - 10https://gerrit.wikimedia.org/r/850633 (owner: 10Majavah) [15:50:48] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Experiment with single backend CDN nodes - https://phabricator.wikimedia.org/T288106 (10BBlack) Update - `ulsfo` is repooled this morning, with all new hardware on the new configuration, and has the "s... [15:50:50] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4052.drmrs.wmnet,service=ats-tls [15:50:50] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4052.drmrs.wmnet,service=ats-be [15:50:51] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4052.drmrs.wmnet,service=varnish-fe [15:51:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P37216 and previous config saved to /var/cache/conftool/dbconfig/20221031-155113-ladsgroup.json [15:52:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P37217 and previous config saved to /var/cache/conftool/dbconfig/20221031-155236-ladsgroup.json [15:53:04] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4052.drmrs.wmnet,service=ats-tls [15:53:04] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4052.drmrs.wmnet,service=ats-be [15:53:05] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4052.drmrs.wmnet,service=varnish-fe [15:54:00] !log [Elastic] `ryankemper@elastic2043:~$ sudo pool` (cluster back to green and DIMM A2 has been switched out by dc-ops); marked as `Active` in netbox [15:54:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:30] (03PS10) 10Jbond: nodegen: add node selections based on commited files [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651925 (https://phabricator.wikimedia.org/T166066) [15:56:05] (03CR) 10jenkins-bot: nodegen: add node selections based on commited files [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651925 (https://phabricator.wikimedia.org/T166066) (owner: 10Jbond) [15:56:11] !log [Elastic] `ryankemper@elastic2052:~$ sudo reboot` to grab latest kernel [15:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:34] (03PS11) 10Jbond: nodegen: add node selections based on commited files [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651925 (https://phabricator.wikimedia.org/T166066) [15:56:56] PROBLEM - Host elastic2052 is DOWN: PING CRITICAL - Packet loss = 100% [15:58:04] (03CR) 10CI reject: [V: 04-1] nodegen: add node selections based on commited files [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651925 (https://phabricator.wikimedia.org/T166066) (owner: 10Jbond) [15:58:22] RECOVERY - Host elastic2052 is UP: PING OK - Packet loss = 0%, RTA = 33.12 ms [15:58:37] (03PS3) 10Jbond: worker: pass error codes for base and change explcitly [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/851090 [15:58:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T318955)', diff saved to https://phabricator.wikimedia.org/P37218 and previous config saved to /var/cache/conftool/dbconfig/20221031-155850-ladsgroup.json [15:58:57] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [15:59:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P37219 and previous config saved to /var/cache/conftool/dbconfig/20221031-155919-ladsgroup.json [16:00:28] (03CR) 10CI reject: [V: 04-1] worker: pass error codes for base and change explcitly [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/851090 (owner: 10Jbond) [16:02:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [16:02:24] (03PS1) 10Majavah: P:prometheus::beta: swap prometheus-labs-targets with a puppetdb query [puppet] - 10https://gerrit.wikimedia.org/r/851101 [16:02:26] (03PS1) 10Majavah: prometheus::wmcs_scripts: deleted unused class [puppet] - 10https://gerrit.wikimedia.org/r/851102 [16:06:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T318955)', diff saved to https://phabricator.wikimedia.org/P37220 and previous config saved to /var/cache/conftool/dbconfig/20221031-160620-ladsgroup.json [16:06:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1191.eqiad.wmnet with reason: Maintenance [16:06:27] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [16:06:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1191.eqiad.wmnet with reason: Maintenance [16:06:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1191 (T318955)', diff saved to https://phabricator.wikimedia.org/P37221 and previous config saved to /var/cache/conftool/dbconfig/20221031-160641-ladsgroup.json [16:07:07] (03CR) 10Ryan Kemper: [C: 03+2] Revert "admin: ryankemper update shell to zsh" [puppet] - 10https://gerrit.wikimedia.org/r/851011 (owner: 10Ryan Kemper) [16:07:26] (03PS4) 10Jbond: worker: pass error codes for base and change explcitly [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/851090 [16:07:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P37222 and previous config saved to /var/cache/conftool/dbconfig/20221031-160743-ladsgroup.json [16:07:45] ACKNOWLEDGEMENT - MD RAID on elastic2052 is CRITICAL: CRITICAL: State: degraded, Active: 5, Working: 5, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T322042 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [16:07:46] (03CR) 10Dzahn: [C: 03+1] varnish: Fix identation [puppet] - 10https://gerrit.wikimedia.org/r/829319 (owner: 10Zabe) [16:07:50] 10SRE, 10ops-codfw: Degraded RAID on elastic2052 - https://phabricator.wikimedia.org/T322042 (10ops-monitoring-bot) [16:09:44] (03Abandoned) 10Dzahn: gitlab::runner: install rsync package [puppet] - 10https://gerrit.wikimedia.org/r/850597 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [16:09:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T318955)', diff saved to https://phabricator.wikimedia.org/P37223 and previous config saved to /var/cache/conftool/dbconfig/20221031-160951-ladsgroup.json [16:12:09] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1002/37862/gitlab-runner1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/850635 (https://phabricator.wikimedia.org/T321629) (owner: 10Ahmon Dancy) [16:13:48] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [16:13:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P37224 and previous config saved to /var/cache/conftool/dbconfig/20221031-161356-ladsgroup.json [16:13:58] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 72, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:14:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T318605)', diff saved to https://phabricator.wikimedia.org/P37225 and previous config saved to /var/cache/conftool/dbconfig/20221031-161426-ladsgroup.json [16:14:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance [16:14:32] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [16:14:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance [16:14:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T318605)', diff saved to https://phabricator.wikimedia.org/P37226 and previous config saved to /var/cache/conftool/dbconfig/20221031-161448-ladsgroup.json [16:15:00] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:15:46] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:16:04] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "if it's urgent I would be ok to deploy this, if it can wait for Jelto to be back, I'll let him review first" [puppet] - 10https://gerrit.wikimedia.org/r/850635 (https://phabricator.wikimedia.org/T321629) (owner: 10Ahmon Dancy) [16:16:34] (03CR) 10Ahmon Dancy: Allow rsync to doc.discovery.wmnet from trusted runner containers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850635 (https://phabricator.wikimedia.org/T321629) (owner: 10Ahmon Dancy) [16:17:22] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:17:31] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4052.ulsfo.wmnet,service=ats-tls [16:17:31] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4052.ulsfo.wmnet,service=ats-be [16:17:31] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4052.ulsfo.wmnet,service=varnish-fe [16:18:02] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:19:20] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:19:22] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:20:02] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:21:24] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:22:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T318605)', diff saved to https://phabricator.wikimedia.org/P37227 and previous config saved to /var/cache/conftool/dbconfig/20221031-162249-ladsgroup.json [16:22:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [16:22:56] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [16:23:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [16:23:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3312 (T318605)', diff saved to https://phabricator.wikimedia.org/P37228 and previous config saved to /var/cache/conftool/dbconfig/20221031-162311-ladsgroup.json [16:24:00] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:24:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T318605)', diff saved to https://phabricator.wikimedia.org/P37229 and previous config saved to /var/cache/conftool/dbconfig/20221031-162405-ladsgroup.json [16:24:33] !log hashar@deploy1002 Started deploy [integration/docroot@0ff8642]: build: Use disableProcessTimeout() for serve commands only [16:24:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P37230 and previous config saved to /var/cache/conftool/dbconfig/20221031-162458-ladsgroup.json [16:24:59] !log hashar@deploy1002 Finished deploy [integration/docroot@0ff8642]: build: Use disableProcessTimeout() for serve commands only (duration: 00m 25s) [16:25:26] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:25:30] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:28:12] PROBLEM - SSH on db1109.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:29:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P37231 and previous config saved to /var/cache/conftool/dbconfig/20221031-162903-ladsgroup.json [16:38:14] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:39:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P37232 and previous config saved to /var/cache/conftool/dbconfig/20221031-163912-ladsgroup.json [16:39:16] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:39:27] (03CR) 10Jbond: "recheck" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651925 (https://phabricator.wikimedia.org/T166066) (owner: 10Jbond) [16:39:43] (03CR) 10Jbond: "recheck" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/851054 (owner: 10Jbond) [16:39:58] (03CR) 10Jbond: "recheck" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/851094 (owner: 10Jbond) [16:40:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P37233 and previous config saved to /var/cache/conftool/dbconfig/20221031-164004-ladsgroup.json [16:43:14] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:44:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T318955)', diff saved to https://phabricator.wikimedia.org/P37234 and previous config saved to /var/cache/conftool/dbconfig/20221031-164409-ladsgroup.json [16:44:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2182.codfw.wmnet with reason: Maintenance [16:44:18] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [16:44:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2182.codfw.wmnet with reason: Maintenance [16:44:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2182 (T318955)', diff saved to https://phabricator.wikimedia.org/P37235 and previous config saved to /var/cache/conftool/dbconfig/20221031-164431-ladsgroup.json [16:48:01] (03CR) 10BCornwall: [C: 04-1] "I just confirmed that 4x spaces seems to be what is now expected. Zabe, if you are so inclined could you please retab the tabs into 4 spac" [puppet] - 10https://gerrit.wikimedia.org/r/829319 (owner: 10Zabe) [16:48:08] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 72, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:48:12] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:49:28] (03CR) 10BCornwall: [C: 04-1] varnish: Fix identation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829319 (owner: 10Zabe) [16:50:10] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:51:28] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:51:28] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:54:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P37236 and previous config saved to /var/cache/conftool/dbconfig/20221031-165418-ladsgroup.json [16:55:06] PROBLEM - Host contint1001 is DOWN: PING CRITICAL - Packet loss = 100% [16:55:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T318955)', diff saved to https://phabricator.wikimedia.org/P37237 and previous config saved to /var/cache/conftool/dbconfig/20221031-165511-ladsgroup.json [16:55:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1194.eqiad.wmnet with reason: Maintenance [16:55:17] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [16:55:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1194.eqiad.wmnet with reason: Maintenance [16:55:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1194 (T318955)', diff saved to https://phabricator.wikimedia.org/P37238 and previous config saved to /var/cache/conftool/dbconfig/20221031-165532-ladsgroup.json [16:55:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T318955)', diff saved to https://phabricator.wikimedia.org/P37239 and previous config saved to /var/cache/conftool/dbconfig/20221031-165532-ladsgroup.json [16:56:06] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:57:24] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:57:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T318955)', diff saved to https://phabricator.wikimedia.org/P37240 and previous config saved to /var/cache/conftool/dbconfig/20221031-165742-ladsgroup.json [16:58:04] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:59:22] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:59:27] (03CR) 10Jbond: "recheck" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/851094 (owner: 10Jbond) [17:00:05] ryankemper: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221031T1700). [17:01:43] (03CR) 10Jbond: "recheck" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/851054 (owner: 10Jbond) [17:01:52] (03CR) 10Jbond: "recheck" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651925 (https://phabricator.wikimedia.org/T166066) (owner: 10Jbond) [17:02:53] !log contint1001 - just went fully down without maintenance work, fortunately 2001 is the prod CI server currently [17:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:26] (03CR) 10BCornwall: [C: 03+1] acme-chief: Unlink certificate renewal and OCSP handling [software/acme-chief] - 10https://gerrit.wikimedia.org/r/820795 (https://phabricator.wikimedia.org/T244232) (owner: 10BCornwall) [17:07:02] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH) [17:07:09] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH) [17:09:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T318605)', diff saved to https://phabricator.wikimedia.org/P37241 and previous config saved to /var/cache/conftool/dbconfig/20221031-170925-ladsgroup.json [17:09:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [17:09:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [17:09:32] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [17:09:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T318605)', diff saved to https://phabricator.wikimedia.org/P37242 and previous config saved to /var/cache/conftool/dbconfig/20221031-170935-ladsgroup.json [17:10:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P37243 and previous config saved to /var/cache/conftool/dbconfig/20221031-171039-ladsgroup.json [17:12:02] (03PS2) 10Zabe: varnish: Fix identation [puppet] - 10https://gerrit.wikimedia.org/r/829319 [17:12:27] (03PS2) 10Daimona Eaytoy: Remove $wgCampaignEventsDatabaseName [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851078 (https://phabricator.wikimedia.org/T318592) [17:12:35] (03PS1) 10Jbond: break sretest [puppet] - 10https://gerrit.wikimedia.org/r/851107 [17:12:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P37244 and previous config saved to /var/cache/conftool/dbconfig/20221031-171248-ladsgroup.json [17:13:17] (03CR) 10Zabe: varnish: Fix identation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829319 (owner: 10Zabe) [17:14:42] !log contint1001 - racadm serveraction powercyle - crashed [17:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:53] (03CR) 10CI reject: [V: 04-1] break sretest [puppet] - 10https://gerrit.wikimedia.org/r/851107 (owner: 10Jbond) [17:15:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T318605)', diff saved to https://phabricator.wikimedia.org/P37245 and previous config saved to /var/cache/conftool/dbconfig/20221031-171501-ladsgroup.json [17:15:10] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [17:16:08] RECOVERY - Host contint1001 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [17:20:19] (03PS1) 10Clare Ming: Update sample rate for edit attempt stream to 1 for group 0. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851109 (https://phabricator.wikimedia.org/T312016) [17:25:25] (03CR) 10Jbond: [C: 03+2] pin setuptools [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/851094 (owner: 10Jbond) [17:25:28] (03CR) 10Jbond: [C: 03+2] nodegen: add a new selector for puppet resources [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/851054 (owner: 10Jbond) [17:25:32] (03CR) 10Jbond: [C: 03+2] nodegen: add node selections based on commited files [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651925 (https://phabricator.wikimedia.org/T166066) (owner: 10Jbond) [17:25:36] (03CR) 10Jbond: [C: 03+2] worker: pass error codes for base and change explcitly [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/851090 (owner: 10Jbond) [17:25:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P37246 and previous config saved to /var/cache/conftool/dbconfig/20221031-172545-ladsgroup.json [17:27:36] (03Merged) 10jenkins-bot: pin setuptools [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/851094 (owner: 10Jbond) [17:27:38] (03CR) 10CI reject: [V: 04-1] nodegen: add a new selector for puppet resources [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/851054 (owner: 10Jbond) [17:27:40] (03CR) 10CI reject: [V: 04-1] nodegen: add node selections based on commited files [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651925 (https://phabricator.wikimedia.org/T166066) (owner: 10Jbond) [17:27:42] (03CR) 10CI reject: [V: 04-1] worker: pass error codes for base and change explcitly [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/851090 (owner: 10Jbond) [17:27:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P37247 and previous config saved to /var/cache/conftool/dbconfig/20221031-172755-ladsgroup.json [17:28:23] (03PS1) 10Jbond: puppet_compiler: bump version [puppet] - 10https://gerrit.wikimedia.org/r/851111 [17:29:08] RECOVERY - SSH on db1109.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:30:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P37248 and previous config saved to /var/cache/conftool/dbconfig/20221031-173008-ladsgroup.json [17:30:16] (03CR) 10Jbond: [C: 03+2] puppet_compiler: bump version [puppet] - 10https://gerrit.wikimedia.org/r/851111 (owner: 10Jbond) [17:33:31] (03PS8) 10Jbond: rsync::server::module: drop auto_ferm_ipv6 parameter [puppet] - 10https://gerrit.wikimedia.org/r/850173 [17:36:45] (03CR) 10Phuedx: [C: 04-1] "Thanks for submitting this patch. See inline." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851109 (https://phabricator.wikimedia.org/T312016) (owner: 10Clare Ming) [17:39:14] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting deployment group membership for mfossati - https://phabricator.wikimedia.org/T321772 (10thcipriani) >>! In T321772#8355722, @SLyngshede-WMF wrote: > @thcipriani do you function as a WMF sponsor/manager as well in this case? Usually I only count a... [17:40:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T318955)', diff saved to https://phabricator.wikimedia.org/P37249 and previous config saved to /var/cache/conftool/dbconfig/20221031-174052-ladsgroup.json [17:43:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T318955)', diff saved to https://phabricator.wikimedia.org/P37250 and previous config saved to /var/cache/conftool/dbconfig/20221031-174301-ladsgroup.json [17:43:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1202.eqiad.wmnet with reason: Maintenance [17:43:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1202.eqiad.wmnet with reason: Maintenance [17:43:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1202 (T318955)', diff saved to https://phabricator.wikimedia.org/P37251 and previous config saved to /var/cache/conftool/dbconfig/20221031-174323-ladsgroup.json [17:44:31] (03CR) 10Ssingh: [C: 03+1] "Looks good and PCC additionally confirms that config files didn't change." [puppet] - 10https://gerrit.wikimedia.org/r/850087 (https://phabricator.wikimedia.org/T321776) (owner: 10Vgutierrez) [17:45:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P37252 and previous config saved to /var/cache/conftool/dbconfig/20221031-174514-ladsgroup.json [17:45:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T318955)', diff saved to https://phabricator.wikimedia.org/P37253 and previous config saved to /var/cache/conftool/dbconfig/20221031-174532-ladsgroup.json [17:45:44] (03CR) 10Dzahn: "seems good to me. Just wonder why in compiler output it also replaces uid/id. it changes from numeric to names. 0 becomes root etc. Is tha" [puppet] - 10https://gerrit.wikimedia.org/r/850173 (owner: 10Jbond) [17:46:18] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 149 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:46:18] (03CR) 10Jbond: [V: 03+2 C: 03+2] nodegen: add a new selector for puppet resources [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/851054 (owner: 10Jbond) [17:46:21] (03CR) 10Jbond: [V: 03+2 C: 03+2] nodegen: add node selections based on commited files [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651925 (https://phabricator.wikimedia.org/T166066) (owner: 10Jbond) [17:46:31] (03CR) 10Jbond: [V: 03+2 C: 03+2] worker: pass error codes for base and change explcitly [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/851090 (owner: 10Jbond) [17:48:16] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:51:41] (03PS3) 10Dzahn: ci: move list of contint and zuul hosts to hieradata/common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/850593 [17:52:00] (03CR) 10Dzahn: ci: move list of contint and zuul hosts to hieradata/common.yaml (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn) [17:52:14] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 236 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:54:12] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 5 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:55:31] (03PS4) 10Dzahn: ci: move list of contint hosts to hieradata/common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/850593 [17:56:26] (03PS5) 10Dzahn: ci: move lists of contint and zuul hosts to hieradata/common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/850593 [18:00:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T318605)', diff saved to https://phabricator.wikimedia.org/P37254 and previous config saved to /var/cache/conftool/dbconfig/20221031-180021-ladsgroup.json [18:00:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: Maintenance [18:00:28] (03CR) 10Clare Ming: Update sample rate for edit attempt stream to 1 for group 0. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851109 (https://phabricator.wikimedia.org/T312016) (owner: 10Clare Ming) [18:00:32] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [18:00:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: Maintenance [18:00:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P37255 and previous config saved to /var/cache/conftool/dbconfig/20221031-180039-ladsgroup.json [18:00:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2175 (T318605)', diff saved to https://phabricator.wikimedia.org/P37256 and previous config saved to /var/cache/conftool/dbconfig/20221031-180049-ladsgroup.json [18:01:36] (03PS2) 10Clare Ming: Update sample rate for edit attempt stream to 1 for group 0. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851109 (https://phabricator.wikimedia.org/T312016) [18:01:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T318605)', diff saved to https://phabricator.wikimedia.org/P37257 and previous config saved to /var/cache/conftool/dbconfig/20221031-180148-ladsgroup.json [18:09:00] (JobUnavailable) firing: Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:11:18] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:15:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P37258 and previous config saved to /var/cache/conftool/dbconfig/20221031-181546-ladsgroup.json [18:16:52] PROBLEM - SSH on mw1319.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:16:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P37259 and previous config saved to /var/cache/conftool/dbconfig/20221031-181654-ladsgroup.json [18:26:38] (03CR) 10Andrew Bogott: [C: 03+2] prometheus::wmcs_scripts: deleted unused class [puppet] - 10https://gerrit.wikimedia.org/r/851102 (owner: 10Majavah) [18:27:14] (03CR) 10Andrew Bogott: prometheus::wmcs_scripts: deleted unused class [puppet] - 10https://gerrit.wikimedia.org/r/851102 (owner: 10Majavah) [18:27:37] (03CR) 10Andrew Bogott: "This is specifically for deployment-prep?" [puppet] - 10https://gerrit.wikimedia.org/r/851101 (owner: 10Majavah) [18:30:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T318955)', diff saved to https://phabricator.wikimedia.org/P37260 and previous config saved to /var/cache/conftool/dbconfig/20221031-183052-ladsgroup.json [18:30:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [18:31:05] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [18:31:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [18:32:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P37261 and previous config saved to /var/cache/conftool/dbconfig/20221031-183201-ladsgroup.json [18:33:24] (03CR) 10Ottomata: [C: 03+2] Declare mediawiki.page_change stream in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849144 (https://phabricator.wikimedia.org/T311129) (owner: 10Ottomata) [18:37:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [18:38:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [18:38:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [18:39:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [18:40:24] (03CR) 10Phuedx: [C: 03+1] Update sample rate for edit attempt stream to 1 for group 0. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851109 (https://phabricator.wikimedia.org/T312016) (owner: 10Clare Ming) [18:44:00] (03CR) 10Samtar: [C: 04-1] "See https://phabricator.wikimedia.org/T310974#8357676, suggest pull from deployment calendar until (further) discussion is had, again." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808424 (https://phabricator.wikimedia.org/T310974) (owner: 10Stang) [18:45:59] (03PS3) 10Clare Ming: Update sample rate for edit attempt stream to 1 for group 0. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851109 (https://phabricator.wikimedia.org/T312016) [18:47:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T318605)', diff saved to https://phabricator.wikimedia.org/P37262 and previous config saved to /var/cache/conftool/dbconfig/20221031-184707-ladsgroup.json [18:47:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance [18:47:16] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [18:47:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance [18:47:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T318605)', diff saved to https://phabricator.wikimedia.org/P37263 and previous config saved to /var/cache/conftool/dbconfig/20221031-184729-ladsgroup.json [18:49:26] (03PS1) 10Andrew Bogott: Add openstack::clientpackages::vms::yoga classes [puppet] - 10https://gerrit.wikimedia.org/r/851115 (https://phabricator.wikimedia.org/T305828) [18:50:22] (03CR) 10Andrew Bogott: [C: 03+2] Add openstack::clientpackages::vms::yoga classes [puppet] - 10https://gerrit.wikimedia.org/r/851115 (https://phabricator.wikimedia.org/T305828) (owner: 10Andrew Bogott) [18:50:32] (03CR) 10Majavah: P:prometheus::beta: swap prometheus-labs-targets with a puppetdb query (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/851101 (owner: 10Majavah) [18:52:49] TheresNoTime: got it, sorry about that :( [18:53:38] you've got nothing to be sorry about, speaking openly it's getting on my nerves :) [18:53:46] (03PS1) 10Ottomata: Declare rc0.mediawiki.page_change stream in beta metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851117 (https://phabricator.wikimedia.org/T311129) [18:54:13] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [18:54:20] (03PS1) 10Jbond: directories: add change id to the output dir [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/851118 [18:56:48] (03PS2) 10Ottomata: Declare rc0.mediawiki.page_change stream in beta metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851117 (https://phabricator.wikimedia.org/T311129) [18:59:13] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:00:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T318605)', diff saved to https://phabricator.wikimedia.org/P37264 and previous config saved to /var/cache/conftool/dbconfig/20221031-190054-ladsgroup.json [19:01:07] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [19:02:46] (03CR) 10Ottomata: [C: 03+2] Declare rc0.mediawiki.page_change stream in beta metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851117 (https://phabricator.wikimedia.org/T311129) (owner: 10Ottomata) [19:03:28] (03Merged) 10jenkins-bot: Declare rc0.mediawiki.page_change stream in beta metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851117 (https://phabricator.wikimedia.org/T311129) (owner: 10Ottomata) [19:10:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [19:11:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [19:11:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [19:12:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [19:16:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P37265 and previous config saved to /var/cache/conftool/dbconfig/20221031-191601-ladsgroup.json [19:16:29] (03PS2) 10Jbond: directories: add change id to the output dir [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/851118 [19:23:43] (03PS2) 10Jbond: puppet_compiler.differ: add support to filter by core type [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/746947 [19:25:35] (03CR) 10CI reject: [V: 04-1] puppet_compiler.differ: add support to filter by core type [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/746947 (owner: 10Jbond) [19:26:31] (03PS3) 10Jbond: directories: add change id to the output dir [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/851118 [19:26:59] (03PS1) 10Andrew Bogott: Add openstack::clientpackages::vms::yoga [puppet] - 10https://gerrit.wikimedia.org/r/851119 [19:27:52] (03CR) 10CI reject: [V: 04-1] directories: add change id to the output dir [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/851118 (owner: 10Jbond) [19:28:38] (03PS2) 10Andrew Bogott: Add openstack::clientpackages::vms::yoga [puppet] - 10https://gerrit.wikimedia.org/r/851119 (https://phabricator.wikimedia.org/T305828) [19:29:52] (03CR) 10Andrew Bogott: [C: 03+2] Add openstack::clientpackages::vms::yoga [puppet] - 10https://gerrit.wikimedia.org/r/851119 (https://phabricator.wikimedia.org/T305828) (owner: 10Andrew Bogott) [19:31:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P37267 and previous config saved to /var/cache/conftool/dbconfig/20221031-193108-ladsgroup.json [19:34:14] (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:36:47] (03PS1) 10Andrew Bogott: Add Openstack files and templates for version 'yoga' [puppet] - 10https://gerrit.wikimedia.org/r/851120 (https://phabricator.wikimedia.org/T305828) [19:37:28] (03CR) 10CI reject: [V: 04-1] Add Openstack files and templates for version 'yoga' [puppet] - 10https://gerrit.wikimedia.org/r/851120 (https://phabricator.wikimedia.org/T305828) (owner: 10Andrew Bogott) [19:41:48] (03PS1) 10BryanDavis: striker: Bump container version to 2022-10-27-235113-production [puppet] - 10https://gerrit.wikimedia.org/r/851121 (https://phabricator.wikimedia.org/T285403) [19:45:21] (03PS2) 10Bartosz Dziewoński: Enable DiscussionTools mobile visual enhancements at jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843581 (https://phabricator.wikimedia.org/T318868) [19:46:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T318605)', diff saved to https://phabricator.wikimedia.org/P37268 and previous config saved to /var/cache/conftool/dbconfig/20221031-194614-ladsgroup.json [19:46:23] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [19:47:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T318605)', diff saved to https://phabricator.wikimedia.org/P37269 and previous config saved to /var/cache/conftool/dbconfig/20221031-194738-ladsgroup.json [19:50:29] (03PS1) 10Ottomata: beta - override mediawiki_page_change on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851122 (https://phabricator.wikimedia.org/T311129) [19:51:38] (03CR) 10Ottomata: [C: 03+2] beta - override mediawiki_page_change on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851122 (https://phabricator.wikimedia.org/T311129) (owner: 10Ottomata) [19:54:44] (03PS2) 10Andrew Bogott: Add Openstack files and templates for version 'yoga' [puppet] - 10https://gerrit.wikimedia.org/r/851120 (https://phabricator.wikimedia.org/T305828) [19:54:46] (03PS1) 10Andrew Bogott: Remove obsolete files for OpenStack version Victoria [puppet] - 10https://gerrit.wikimedia.org/r/851123 [19:54:48] (03PS1) 10Andrew Bogott: Add openstack::designate::service::yoga [puppet] - 10https://gerrit.wikimedia.org/r/851124 (https://phabricator.wikimedia.org/T305828) [19:55:32] (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools visual enhancements beta feature at jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851125 (https://phabricator.wikimedia.org/T318127) [19:55:40] (03CR) 10CI reject: [V: 04-1] Add Openstack files and templates for version 'yoga' [puppet] - 10https://gerrit.wikimedia.org/r/851120 (https://phabricator.wikimedia.org/T305828) (owner: 10Andrew Bogott) [19:55:44] (03CR) 10CI reject: [V: 04-1] Remove obsolete files for OpenStack version Victoria [puppet] - 10https://gerrit.wikimedia.org/r/851123 (owner: 10Andrew Bogott) [19:56:09] (03CR) 10CI reject: [V: 04-1] Add openstack::designate::service::yoga [puppet] - 10https://gerrit.wikimedia.org/r/851124 (https://phabricator.wikimedia.org/T305828) (owner: 10Andrew Bogott) [19:57:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [19:58:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [19:58:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [19:59:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [20:00:04] RoanKattouw, Urbanecm, cjming, and TheresNoTime: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221031T2000). nyaa~ [20:00:05] arlolra, cjming, and ebernhardson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:19] hi all - happy to deploy since i got one in the queue [20:00:28] sounds good! [20:00:41] \o [20:00:51] (03CR) 10BCornwall: [C: 04-1] "Thanks for that! I noticed a few places where the conversion changed the indentation (because of mixed tabs/spaces)." [puppet] - 10https://gerrit.wikimedia.org/r/829319 (owner: 10Zabe) [20:00:52] I am unavailable so please do :) [20:00:58] (03PS3) 10Bartosz Dziewoński: Enable DiscussionTools mobile visual enhancements at jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843581 (https://phabricator.wikimedia.org/T318870) [20:01:11] (03PS2) 10Bartosz Dziewoński: Enable DiscussionTools visual enhancements beta feature at jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851125 (https://phabricator.wikimedia.org/T318127) [20:01:24] arlolra: are you around? [20:01:29] yes [20:01:35] (03PS4) 10Clare Ming: Disable wgParserEnableLegacyMediaDOM on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844073 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra) [20:01:45] arlolra: great - starting with yours [20:01:53] thank you [20:02:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [20:02:33] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844073 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra) [20:02:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P37270 and previous config saved to /var/cache/conftool/dbconfig/20221031-200245-ladsgroup.json [20:03:02] (03PS2) 10Andrew Bogott: Remove obsolete files for OpenStack version Victoria [puppet] - 10https://gerrit.wikimedia.org/r/851123 [20:03:04] (03PS3) 10Andrew Bogott: Add Openstack files and templates for version 'yoga' [puppet] - 10https://gerrit.wikimedia.org/r/851120 (https://phabricator.wikimedia.org/T305828) [20:03:06] (03PS2) 10Andrew Bogott: Add openstack::designate::service::yoga [puppet] - 10https://gerrit.wikimedia.org/r/851124 (https://phabricator.wikimedia.org/T305828) [20:03:20] (03Merged) 10jenkins-bot: Disable wgParserEnableLegacyMediaDOM on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844073 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra) [20:03:36] !log cjming@deploy1002 Started scap: Backport for [[gerrit:844073|Disable wgParserEnableLegacyMediaDOM on itwiki (T314318)]] [20:03:45] T314318: Disable wgParserEnableLegacyMediaDOM on all wikis - https://phabricator.wikimedia.org/T314318 [20:03:56] !log cjming@deploy1002 cjming and arlolra: Backport for [[gerrit:844073|Disable wgParserEnableLegacyMediaDOM on itwiki (T314318)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [20:04:05] (03CR) 10CI reject: [V: 04-1] Remove obsolete files for OpenStack version Victoria [puppet] - 10https://gerrit.wikimedia.org/r/851123 (owner: 10Andrew Bogott) [20:04:08] (03CR) 10CI reject: [V: 04-1] Add Openstack files and templates for version 'yoga' [puppet] - 10https://gerrit.wikimedia.org/r/851120 (https://phabricator.wikimedia.org/T305828) (owner: 10Andrew Bogott) [20:04:19] arlolra: up on any debug server if your patch is verifiable [20:04:28] ok, testing [20:04:32] (03CR) 10CI reject: [V: 04-1] Add openstack::designate::service::yoga [puppet] - 10https://gerrit.wikimedia.org/r/851124 (https://phabricator.wikimedia.org/T305828) (owner: 10Andrew Bogott) [20:05:13] cjming: looks good [20:05:20] cool - going live [20:05:43] doing my patch here next [20:08:39] (03PS3) 10Andrew Bogott: Remove obsolete files for OpenStack version Victoria [puppet] - 10https://gerrit.wikimedia.org/r/851123 [20:08:41] (03PS4) 10Andrew Bogott: Add Openstack files and templates for version 'yoga' [puppet] - 10https://gerrit.wikimedia.org/r/851120 (https://phabricator.wikimedia.org/T305828) [20:08:43] (03PS3) 10Andrew Bogott: Add openstack::designate::service::yoga [puppet] - 10https://gerrit.wikimedia.org/r/851124 (https://phabricator.wikimedia.org/T305828) [20:08:45] (03PS1) 10Andrew Bogott: wmf spec tests: only test on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/851126 [20:09:20] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:844073|Disable wgParserEnableLegacyMediaDOM on itwiki (T314318)]] (duration: 05m 44s) [20:09:28] T314318: Disable wgParserEnableLegacyMediaDOM on all wikis - https://phabricator.wikimedia.org/T314318 [20:09:34] (03CR) 10CI reject: [V: 04-1] Remove obsolete files for OpenStack version Victoria [puppet] - 10https://gerrit.wikimedia.org/r/851123 (owner: 10Andrew Bogott) [20:09:45] (03CR) 10BryanDavis: "PCC results: https://puppet-compiler.wmflabs.org/pcc-worker1003/37869/" [puppet] - 10https://gerrit.wikimedia.org/r/851121 (https://phabricator.wikimedia.org/T285403) (owner: 10BryanDavis) [20:09:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851109 (https://phabricator.wikimedia.org/T312016) (owner: 10Clare Ming) [20:09:55] (03CR) 10CI reject: [V: 04-1] Add Openstack files and templates for version 'yoga' [puppet] - 10https://gerrit.wikimedia.org/r/851120 (https://phabricator.wikimedia.org/T305828) (owner: 10Andrew Bogott) [20:10:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [20:10:12] arlolra: should be live! [20:10:29] (03PS4) 10Clare Ming: Update sample rate for edit attempt stream to 1 for group 0. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851109 (https://phabricator.wikimedia.org/T312016) [20:10:41] (03CR) 10CI reject: [V: 04-1] Add openstack::designate::service::yoga [puppet] - 10https://gerrit.wikimedia.org/r/851124 (https://phabricator.wikimedia.org/T305828) (owner: 10Andrew Bogott) [20:11:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [20:11:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [20:11:10] (03PS1) 10Ottomata: Declare rc0.mediawiki.page_change and enable it only in beta wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851127 (https://phabricator.wikimedia.org/T311129) [20:11:32] (03CR) 10TrainBranchBot: "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851109 (https://phabricator.wikimedia.org/T312016) (owner: 10Clare Ming) [20:11:48] cjming: thank you [20:11:56] np! [20:12:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [20:12:24] (03CR) 10CI reject: [V: 04-1] Declare rc0.mediawiki.page_change and enable it only in beta wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851127 (https://phabricator.wikimedia.org/T311129) (owner: 10Ottomata) [20:12:28] (03Merged) 10jenkins-bot: Update sample rate for edit attempt stream to 1 for group 0. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851109 (https://phabricator.wikimedia.org/T312016) (owner: 10Clare Ming) [20:12:44] !log cjming@deploy1002 Started scap: Backport for [[gerrit:851109|Update sample rate for edit attempt stream to 1 for group 0. (T312016)]] [20:12:50] T312016: Increase EditAttemptStep sampling rate(s) to 100% - https://phabricator.wikimedia.org/T312016 [20:13:03] !log cjming@deploy1002 cjming and cjming: Backport for [[gerrit:851109|Update sample rate for edit attempt stream to 1 for group 0. (T312016)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:13:47] (03PS2) 10Ottomata: Declare rc0.mediawiki.page_change and enable it only in beta wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851127 (https://phabricator.wikimedia.org/T311129) [20:14:21] (03PS3) 10Ottomata: Declare rc0.mediawiki.page_change and enable it only in beta wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851127 (https://phabricator.wikimedia.org/T311129) [20:14:41] (03CR) 10Andrew Bogott: [C: 03+2] striker: Bump container version to 2022-10-27-235113-production [puppet] - 10https://gerrit.wikimedia.org/r/851121 (https://phabricator.wikimedia.org/T285403) (owner: 10BryanDavis) [20:16:26] (03PS3) 10Zabe: varnish: Fix identation [puppet] - 10https://gerrit.wikimedia.org/r/829319 [20:17:01] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:851109|Update sample rate for edit attempt stream to 1 for group 0. (T312016)]] (duration: 04m 17s) [20:17:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [20:17:32] (03CR) 10Zabe: varnish: Fix identation (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/829319 (owner: 10Zabe) [20:17:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P37271 and previous config saved to /var/cache/conftool/dbconfig/20221031-201751-ladsgroup.json [20:18:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [20:18:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [20:18:33] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843521 (https://phabricator.wikimedia.org/T262630) (owner: 10Ebernhardson) [20:19:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [20:19:11] (03PS2) 10Andrew Bogott: wmf spec tests: Update to test Bullseye/Xena [puppet] - 10https://gerrit.wikimedia.org/r/851126 [20:19:13] (03PS4) 10Andrew Bogott: Remove obsolete files for OpenStack version Victoria [puppet] - 10https://gerrit.wikimedia.org/r/851123 [20:19:15] (03PS5) 10Andrew Bogott: Add Openstack files and templates for version 'yoga' [puppet] - 10https://gerrit.wikimedia.org/r/851120 (https://phabricator.wikimedia.org/T305828) [20:19:17] (03PS4) 10Andrew Bogott: Add openstack::designate::service::yoga [puppet] - 10https://gerrit.wikimedia.org/r/851124 (https://phabricator.wikimedia.org/T305828) [20:19:55] (03CR) 10CI reject: [V: 04-1] wmf spec tests: Update to test Bullseye/Xena [puppet] - 10https://gerrit.wikimedia.org/r/851126 (owner: 10Andrew Bogott) [20:20:08] (03Merged) 10jenkins-bot: cirrus: Correct comments in ProductionServices.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843521 (https://phabricator.wikimedia.org/T262630) (owner: 10Ebernhardson) [20:20:16] (03CR) 10CI reject: [V: 04-1] Remove obsolete files for OpenStack version Victoria [puppet] - 10https://gerrit.wikimedia.org/r/851123 (owner: 10Andrew Bogott) [20:20:20] !log cjming@deploy1002 Started scap: Backport for [[gerrit:843521|cirrus: Correct comments in ProductionServices.php (T262630)]] [20:20:27] T262630: ProductionServices.php has cloudelastic-{psi,omega}-eqiad ports mixed up - https://phabricator.wikimedia.org/T262630 [20:20:39] !log cjming@deploy1002 cjming and ebernhardson: Backport for [[gerrit:843521|cirrus: Correct comments in ProductionServices.php (T262630)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [20:20:45] (03CR) 10jenkins-bot: Add Openstack files and templates for version 'yoga' [puppet] - 10https://gerrit.wikimedia.org/r/851120 (https://phabricator.wikimedia.org/T305828) (owner: 10Andrew Bogott) [20:20:53] ebernhardson: your patch is up on the debug servers if you'd like to check [20:21:12] unless it's no-op [20:21:26] (03CR) 10jenkins-bot: Add openstack::designate::service::yoga [puppet] - 10https://gerrit.wikimedia.org/r/851124 (https://phabricator.wikimedia.org/T305828) (owner: 10Andrew Bogott) [20:21:51] cjming: all looks reasonable, it's a noop but it did reorder some config. But that doesn't seem to have changed anything [20:21:59] great - syncing [20:22:13] (03CR) 10Ottomata: [C: 03+2] Declare rc0.mediawiki.page_change and enable it only in beta wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851127 (https://phabricator.wikimedia.org/T311129) (owner: 10Ottomata) [20:22:54] (03PS3) 10Jbond: puppet_compiler.differ: add support to filter by core type [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/746947 [20:23:05] (03Merged) 10jenkins-bot: Declare rc0.mediawiki.page_change and enable it only in beta wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851127 (https://phabricator.wikimedia.org/T311129) (owner: 10Ottomata) [20:24:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [20:24:58] (03CR) 10CI reject: [V: 04-1] puppet_compiler.differ: add support to filter by core type [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/746947 (owner: 10Jbond) [20:24:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [20:24:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [20:25:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [20:26:18] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:843521|cirrus: Correct comments in ProductionServices.php (T262630)]] (duration: 05m 57s) [20:26:25] T262630: ProductionServices.php has cloudelastic-{psi,omega}-eqiad ports mixed up - https://phabricator.wikimedia.org/T262630 [20:26:45] ebernhardson: should be live! [20:28:13] cjming: thanks! [20:28:21] np! [20:31:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [20:31:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [20:31:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [20:32:42] !log end of UTC late backport window [20:32:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [20:32:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T318605)', diff saved to https://phabricator.wikimedia.org/P37272 and previous config saved to /var/cache/conftool/dbconfig/20221031-203258-ladsgroup.json [20:33:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance [20:33:05] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [20:33:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance [20:33:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1188 (T318605)', diff saved to https://phabricator.wikimedia.org/P37273 and previous config saved to /var/cache/conftool/dbconfig/20221031-203319-ladsgroup.json [20:33:46] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: No-op sync of InitialiseSettings.php to declare stream rc0.mediawiki.page_change. This stream is disabled everywhere by default, and only enabled in beta for now. - T311129 (duration: 03m 42s) [20:33:52] T311129: [Shared Event Platform] Produce new mediawiki.page-change stream from MediaWiki EventBus - https://phabricator.wikimedia.org/T311129 [20:37:46] (03PS4) 10Jbond: puppet_compiler.differ: add support to filter by core type [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/746947 [20:39:19] (03CR) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [20:39:41] (03CR) 10CI reject: [V: 04-1] puppet_compiler.differ: add support to filter by core type [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/746947 (owner: 10Jbond) [20:41:30] (03PS1) 10Clare Ming: Add MP stream for VisualEditorFeatureUse instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851128 (https://phabricator.wikimedia.org/T309602) [20:41:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T318605)', diff saved to https://phabricator.wikimedia.org/P37274 and previous config saved to /var/cache/conftool/dbconfig/20221031-204157-ladsgroup.json [20:42:08] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [20:43:36] (03CR) 10Clare Ming: Add MP stream for VisualEditorFeatureUse instrument (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851128 (https://phabricator.wikimedia.org/T309602) (owner: 10Clare Ming) [20:44:22] (03PS5) 10Jbond: puppet_compiler.differ: add support to filter by core type [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/746947 [20:46:21] (03CR) 10CI reject: [V: 04-1] puppet_compiler.differ: add support to filter by core type [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/746947 (owner: 10Jbond) [20:50:22] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:52:09] (03PS4) 10BCornwall: prometheus: Add ats header/body size total metrics [puppet] - 10https://gerrit.wikimedia.org/r/845688 (https://phabricator.wikimedia.org/T284304) [20:53:26] (03CR) 10BCornwall: prometheus: Add ats header/body size total metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845688 (https://phabricator.wikimedia.org/T284304) (owner: 10BCornwall) [20:57:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P37275 and previous config saved to /var/cache/conftool/dbconfig/20221031-205703-ladsgroup.json [21:00:05] Reedy, sbassett, Maryum, and manfredi: It is that lovely time of the day again! You are hereby commanded to deploy Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221031T2100). [21:00:20] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:02:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure: Attempt to move some GPUs from Hadoop to the DSE-K8S cluster - https://phabricator.wikimedia.org/T318696 (10Ottomata) @BTullis can/should we just remove those nodes as Hadoop workers and reimage them as DSE workers? We can probably due without th... [21:06:16] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:12:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P37276 and previous config saved to /var/cache/conftool/dbconfig/20221031-211210-ladsgroup.json [21:20:50] (03PS1) 10Bartosz Dziewoński: Update wgSpecialContributeSkinsDisabled → wgSpecialContributeSkinsEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851132 (https://phabricator.wikimedia.org/T319327) [21:27:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T318605)', diff saved to https://phabricator.wikimedia.org/P37277 and previous config saved to /var/cache/conftool/dbconfig/20221031-212717-ladsgroup.json [21:27:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance [21:27:22] PROBLEM - SSH on mw1338.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:27:24] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [21:27:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance [21:27:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1197 (T318605)', diff saved to https://phabricator.wikimedia.org/P37278 and previous config saved to /var/cache/conftool/dbconfig/20221031-212749-ladsgroup.json [21:29:46] 10ops-eqiad, 10DC-Ops: Q1:rack/setup/install elastic10[53-67] - https://phabricator.wikimedia.org/T322082 (10RKemper) [21:31:02] 10ops-eqiad, 10DC-Ops: Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 (10RKemper) [21:31:14] 10ops-eqiad, 10DC-Ops: Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 (10RKemper) [21:33:05] 10ops-eqiad, 10DC-Ops: Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 (10RKemper) [21:36:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T318605)', diff saved to https://phabricator.wikimedia.org/P37279 and previous config saved to /var/cache/conftool/dbconfig/20221031-213632-ladsgroup.json [21:36:38] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [21:37:02] 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 (10RKemper) [21:51:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P37280 and previous config saved to /var/cache/conftool/dbconfig/20221031-215138-ladsgroup.json [21:52:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 (10RKemper) [21:52:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 (10RKemper) [21:56:35] (03PS6) 10Andrew Bogott: Add Openstack files and templates for version 'yoga' [puppet] - 10https://gerrit.wikimedia.org/r/851120 (https://phabricator.wikimedia.org/T305828) [21:56:37] (03PS5) 10Andrew Bogott: Add openstack::designate::service::yoga [puppet] - 10https://gerrit.wikimedia.org/r/851124 (https://phabricator.wikimedia.org/T305828) [21:56:39] (03PS3) 10Andrew Bogott: wmf spec tests: Update to test Bullseye/Xena [puppet] - 10https://gerrit.wikimedia.org/r/851126 [21:56:41] (03PS5) 10Andrew Bogott: Remove obsolete files for OpenStack version Victoria [puppet] - 10https://gerrit.wikimedia.org/r/851123 [21:58:09] (03CR) 10CI reject: [V: 04-1] wmf spec tests: Update to test Bullseye/Xena [puppet] - 10https://gerrit.wikimedia.org/r/851126 (owner: 10Andrew Bogott) [21:58:18] (03PS1) 10BCornwall: prometheus: Rename ats_ metrics to trafficserver_ [puppet] - 10https://gerrit.wikimedia.org/r/851139 (https://phabricator.wikimedia.org/T292815) [21:58:56] (03CR) 10CI reject: [V: 04-1] Remove obsolete files for OpenStack version Victoria [puppet] - 10https://gerrit.wikimedia.org/r/851123 (owner: 10Andrew Bogott) [22:06:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P37281 and previous config saved to /var/cache/conftool/dbconfig/20221031-220645-ladsgroup.json [22:07:06] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:07:14] PROBLEM - SSH on mw1312.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:09:00] (JobUnavailable) firing: Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:12:04] (03PS2) 10BCornwall: prometheus: Rename ats_ metrics to trafficserver_ [puppet] - 10https://gerrit.wikimedia.org/r/851139 (https://phabricator.wikimedia.org/T292815) [22:15:20] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:16:11] (03PS4) 10Andrew Bogott: wmf spec tests: Update to test Bullseye/Xena [puppet] - 10https://gerrit.wikimedia.org/r/851126 [22:16:13] (03PS6) 10Andrew Bogott: Remove obsolete files for OpenStack version Victoria [puppet] - 10https://gerrit.wikimedia.org/r/851123 [22:16:50] (03CR) 10Andrew Bogott: "This test won't work until there are bullseye test runners." [puppet] - 10https://gerrit.wikimedia.org/r/851126 (owner: 10Andrew Bogott) [22:16:55] (03CR) 10CI reject: [V: 04-1] wmf spec tests: Update to test Bullseye/Xena [puppet] - 10https://gerrit.wikimedia.org/r/851126 (owner: 10Andrew Bogott) [22:17:23] (03CR) 10CI reject: [V: 04-1] Remove obsolete files for OpenStack version Victoria [puppet] - 10https://gerrit.wikimedia.org/r/851123 (owner: 10Andrew Bogott) [22:18:33] (03PS1) 10Andrew Bogott: codfw1dev designate -> version 'yoga' [puppet] - 10https://gerrit.wikimedia.org/r/851142 (https://phabricator.wikimedia.org/T305828) [22:18:58] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 143 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:20:58] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:21:16] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:21:48] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/845529 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [22:21:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T318605)', diff saved to https://phabricator.wikimedia.org/P37282 and previous config saved to /var/cache/conftool/dbconfig/20221031-222151-ladsgroup.json [22:21:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [22:22:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [22:22:35] (03CR) 10Cwhite: [C: 03+1] smokeping: remove ancillary data [puppet] - 10https://gerrit.wikimedia.org/r/850157 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [22:22:47] (03CR) 10Cwhite: [C: 03+1] profile: absent smokeping [puppet] - 10https://gerrit.wikimedia.org/r/850155 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [22:22:59] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [22:23:20] (03PS6) 10Andrew Bogott: Add openstack::designate::service::yoga [puppet] - 10https://gerrit.wikimedia.org/r/851124 (https://phabricator.wikimedia.org/T305828) [22:23:22] (03PS2) 10Andrew Bogott: codfw1dev designate -> version 'yoga' [puppet] - 10https://gerrit.wikimedia.org/r/851142 (https://phabricator.wikimedia.org/T305828) [22:23:24] (03PS1) 10Andrew Bogott: Add openstack::serverpackages::yoga::bullseye [puppet] - 10https://gerrit.wikimedia.org/r/851143 (https://phabricator.wikimedia.org/T305828) [22:26:21] (03CR) 10Andrew Bogott: [C: 03+2] Add Openstack files and templates for version 'yoga' [puppet] - 10https://gerrit.wikimedia.org/r/851120 (https://phabricator.wikimedia.org/T305828) (owner: 10Andrew Bogott) [22:26:31] (03CR) 10Andrew Bogott: [C: 03+2] Add openstack::serverpackages::yoga::bullseye [puppet] - 10https://gerrit.wikimedia.org/r/851143 (https://phabricator.wikimedia.org/T305828) (owner: 10Andrew Bogott) [22:26:43] (03CR) 10Andrew Bogott: [C: 03+2] Add openstack::designate::service::yoga [puppet] - 10https://gerrit.wikimedia.org/r/851124 (https://phabricator.wikimedia.org/T305828) (owner: 10Andrew Bogott) [22:27:58] PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:28:01] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev designate -> version 'yoga' [puppet] - 10https://gerrit.wikimedia.org/r/851142 (https://phabricator.wikimedia.org/T305828) (owner: 10Andrew Bogott) [22:28:14] RECOVERY - SSH on mw1338.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:30:48] (03CR) 10Cwhite: smokeping: add ensure parameter, set to present (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850154 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [22:31:03] (03CR) 10Cwhite: [C: 03+1] smokeping: remove module and profile [puppet] - 10https://gerrit.wikimedia.org/r/850156 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [22:31:44] (03CR) 10Cwhite: [C: 03+1] hieradata: don't monitor /run/docker on alerting_host [puppet] - 10https://gerrit.wikimedia.org/r/850993 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [22:35:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [22:36:35] (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [22:40:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [22:41:35] (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [22:43:45] (JobUnavailable) firing: (2) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:45:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:51:02] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:52:16] (03PS1) 10Bartosz Dziewoński: Clean up wgDiscussionToolsABTest config for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851147 [22:52:31] (03PS2) 10Bartosz Dziewoński: Clean up wgDiscussionToolsABTest config for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851147 [22:54:13] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [22:57:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [22:58:37] (03PS1) 10Andrew Bogott: pdns-recursor: remove delegation-only config setting [puppet] - 10https://gerrit.wikimedia.org/r/851148 [22:59:13] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:00:05] (03CR) 10Andrew Bogott: "This change is premised on the idea that there was no compelling reason to set this in the first place... I'm open to learning otherwise." [puppet] - 10https://gerrit.wikimedia.org/r/851148 (owner: 10Andrew Bogott) [23:03:45] (JobUnavailable) firing: (2) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:27:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [23:28:45] (JobUnavailable) firing: (2) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:28:52] RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:30:24] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:34:14] (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:36:20] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:54:56] (03PS30) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [23:56:52] (03CR) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [23:58:23] (03CR) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)