[00:22:29] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[00:28:09] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:29:35] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:34:23] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[00:38:28] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/959980
[00:38:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/959980 (owner: 10TrainBranchBot)
[00:39:01] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[00:42:13] <icinga-wm>	 RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:42:21] <icinga-wm>	 RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:48:29] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:49:09] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[00:49:55] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:53:15] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/959980 (owner: 10TrainBranchBot)
[01:19:07] <jinxer-wm>	 (PuppetConstantChange) firing: (2) Puppet performing a change on every puppet run - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[01:31:13] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:33:37] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[01:48:57] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:50:49] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[01:51:47] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:57:27] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:58:53] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:07:31] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:08:44] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:10:25] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:23:44] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:30:45] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:35:07] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:38:37] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:38:44] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:40:05] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:40:45] <wikibugs>	 (03PS2) 10DDesouza: Deploy Reader Demographics 2 pilot survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959826 (https://phabricator.wikimedia.org/T345951)
[02:43:15] <wikibugs>	 10SRE, 10RESTBase, 10RESTBase-API, 10Traffic: REST API not returning latest page when queried title is a redirect - https://phabricator.wikimedia.org/T346579 (10Brycehughes) Ah ok. Thanks for checking. I suppose this can just sit open for a bit. I have a workaround, it just involves me hitting the API 2-3...
[02:48:49] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes2010.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2010.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[02:49:15] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:52:09] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:58:03] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[03:03:03] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[03:26:15] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[03:46:07] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[03:54:49] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[04:07:49] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[04:32:13] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[04:42:37] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:44:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:45:04] <wikibugs>	 (03PS2) 10KartikMistry: Update cxserver to 2023-09-13-074325-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/959156 (https://phabricator.wikimedia.org/T346045)
[04:45:36] * kart_ updating cxserver. Minor changes.
[04:49:10] <kart_>	 OK. I'll hold this till tomorrow. Are mesh changes OK to deploy (seems already deployed in staging) godog ?
[04:52:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[04:55:51] <_joe_>	 kart_: you really need to read ops@; there was an email from Janis explaining those changes are safe to deploy
[04:56:41] <_joe_>	 kart_: that's where we announce such changes; we're ofc open to suggestions on how to make such communications stand out
[04:57:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[04:58:34] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: remove old eswikiquote and eswikibooks models [deployment-charts] - 10https://gerrit.wikimedia.org/r/960234 (https://phabricator.wikimedia.org/T342266)
[04:58:36] <kart_>	 My bad. It was a month back and I also conveyed that to team :/
[04:58:57] <kart_>	 _joe_: sorry for noise.
[04:59:09] <_joe_>	 kart_: np :P
[04:59:43] <_joe_>	 I didn't realize it was almost a month ago, sheesh
[05:00:09] <kart_>	 That means we've not deployed cxserver since then :)
[05:00:58] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: ml-services: remove old eswikiquote and eswikibooks models [deployment-charts] - 10https://gerrit.wikimedia.org/r/960234 (https://phabricator.wikimedia.org/T342266)
[05:01:12] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-09-13-074325-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/959156 (https://phabricator.wikimedia.org/T346045) (owner: 10KartikMistry)
[05:02:21] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2023-09-13-074325-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/959156 (https://phabricator.wikimedia.org/T346045) (owner: 10KartikMistry)
[05:08:05] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply
[05:08:31] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[05:12:30] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[05:13:03] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[05:22:28] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[05:22:56] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[05:23:23] <kart_>	 !log Updated cxserver to 2023-09-13-074325-production (T346045)
[05:23:44] <jinxer-wm>	 (PuppetConstantChange) firing: (2) Puppet performing a change on every puppet run - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[05:24:22] <kart_>	 !log Updated cxserver to 2023-09-13-074325-production (T346045)
[05:24:33] <kart_>	 hmm?
[05:47:53] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[05:47:53] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:49:21] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:52:37] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:53:23] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:54:01] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:54:49] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:04:17] <jinxer-wm>	 (PoolcounterFullQueues) firing: Full queues for poolcounter2003:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:09:17] <jinxer-wm>	 (PoolcounterFullQueues) resolved: Full queues for poolcounter2003:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:14:09] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 35008
[06:14:34] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 35008
[06:28:35] <icinga-wm>	 PROBLEM - Wikitech and wt-static content in sync on wikitech-static.wikimedia.org is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (208821s 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static
[06:38:46] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:46:43] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:46:47] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:46:53] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:48:15] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:48:19] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Block inbound RAs on the routers [homer/public] - 10https://gerrit.wikimedia.org/r/959732 (https://phabricator.wikimedia.org/T334916) (owner: 10Ayounsi)
[06:48:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] mcrouter: Specify missing CXXFLAGS (031 comment) [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/860584 (owner: 10TK-999)
[06:48:49] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes2010.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2010.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[06:49:39] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50569 bytes in 2.959 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:50:21] <wikibugs>	 (03Merged) 10jenkins-bot: Block inbound RAs on the routers [homer/public] - 10https://gerrit.wikimedia.org/r/959732 (https://phabricator.wikimedia.org/T334916) (owner: 10Ayounsi)
[06:50:53] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.271 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:55:50] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] hieradata: drop dmz_cidr excemptions for cloudmetrics1003/4 [puppet] - 10https://gerrit.wikimedia.org/r/960028 (https://phabricator.wikimedia.org/T326266) (owner: 10Majavah)
[06:55:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] firewall: Default provider to none [puppet] - 10https://gerrit.wikimedia.org/r/960011 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[06:56:24] <moritzm>	 taavi: I'll merge your change along, ok?
[06:56:27] <taavi>	 yes please
[06:56:53] <moritzm>	 ack, merged now
[07:00:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] LVS: Set profile::firewall::provider: none [puppet] - 10https://gerrit.wikimedia.org/r/959954 (owner: 10Muehlenhoff)
[07:00:06] <jouncebot>	 Amir1, Urbanecm, and taavi: Dear deployers, time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230925T0700).
[07:00:06] <jouncebot>	 Sohom_Datta: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:39] <wikibugs>	 (03PS3) 10Muehlenhoff: profile::cumin::cloud_target: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/959179
[07:00:47] <Sohom_Datta>	 o/
[07:02:23] <taavi>	 o/ I can deploy
[07:03:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [extensions/PageTriage] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959986 (https://phabricator.wikimedia.org/T345496) (owner: 10Sohom Datta)
[07:04:25] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959179 (owner: 10Muehlenhoff)
[07:06:10] <XioNoX>	 !log roll out "Block inbound RAs on the routers" - T334916
[07:06:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:06:17] <stashbot>	 T334916: Juniper RA receive bug CVE-2023-28981 - https://phabricator.wikimedia.org/T334916
[07:07:45] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: modules: add base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/959803 (https://phabricator.wikimedia.org/T343025)
[07:07:47] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: thumbor: use base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/959948 (https://phabricator.wikimedia.org/T343025)
[07:07:49] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: apertium: use base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/960543 (https://phabricator.wikimedia.org/T343025)
[07:08:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] apertium: use base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/960543 (https://phabricator.wikimedia.org/T343025) (owner: 10Giuseppe Lavagetto)
[07:15:28] <wikibugs>	 (03Merged) 10jenkins-bot: Make sure different key values are handled while submitting [extensions/PageTriage] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959986 (https://phabricator.wikimedia.org/T345496) (owner: 10Sohom Datta)
[07:16:30] <logmsgbot>	 !log taavi@deploy2002 Started scap: Backport for [[gerrit:959986|Make sure different key values are handled while submitting (T345496)]]
[07:16:39] <stashbot>	 T345496: If a user tries to place two of the same tag, should show a warning or silently delete one tag - https://phabricator.wikimedia.org/T345496
[07:20:27] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[07:20:34] <wikibugs>	 (03CR) 10Muehlenhoff: puppetdb: preseed to avoid creating database users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959231 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway)
[07:22:21] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[07:26:16] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[07:26:52] <wikibugs>	 (03PS1) 10Elukey: Add nodejs 18 images on Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/960544
[07:27:46] <wikibugs>	 (03PS2) 10Elukey: Add nodejs 18 images on Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/960544
[07:29:25] <logmsgbot>	 !log taavi@deploy2002 taavi and soda: Backport for [[gerrit:959986|Make sure different key values are handled while submitting (T345496)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[07:29:29] <taavi>	 finally
[07:29:32] <taavi>	 Sohom_Datta: ^ please test
[07:29:32] <stashbot>	 T345496: If a user tries to place two of the same tag, should show a warning or silently delete one tag - https://phabricator.wikimedia.org/T345496
[07:31:05] <Sohom_Datta>	 On it :)
[07:35:06] <wikibugs>	 (03PS1) 10Urbanecm: growth: Enable section-image recommendations on 10 new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960545 (https://phabricator.wikimedia.org/T345940)
[07:35:52] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] k8s: Fix dependencies for resources requiring kube user [puppet] - 10https://gerrit.wikimedia.org/r/959722 (owner: 10JMeybohm)
[07:36:23] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] cr-cloud: Drop cloudmetrics excemptions [homer/public] - 10https://gerrit.wikimedia.org/r/960027 (https://phabricator.wikimedia.org/T326266) (owner: 10Majavah)
[07:36:26] <MPGuy2824>	 Sohom_Datta looks ok to me : https://en.wikipedia.org/wiki/Emile_van_Rouveroy_van_Nieuwaal
[07:37:02] <wikibugs>	 (03Merged) 10jenkins-bot: cr-cloud: Drop cloudmetrics excemptions [homer/public] - 10https://gerrit.wikimedia.org/r/960027 (https://phabricator.wikimedia.org/T326266) (owner: 10Majavah)
[07:37:22] <XioNoX>	 !log update eqsin-ulsfo tranport link ospf metrics to match the new latency of 175ms
[07:37:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:37:45] <Sohom_Datta>	 taavi: Looks good :) per MPGuy
[07:37:57] <taavi>	 thanks, syncing
[07:37:59] <logmsgbot>	 !log taavi@deploy2002 taavi and soda: Continuing with sync
[07:38:43] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] prometheus::k8s: Discover calico-felix targets from k8s api (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/960049 (https://phabricator.wikimedia.org/T346915) (owner: 10JMeybohm)
[07:40:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/960544 (owner: 10Elukey)
[07:40:19] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+2] prometheus::k8s: Discover calico-felix targets from k8s api [puppet] - 10https://gerrit.wikimedia.org/r/960049 (https://phabricator.wikimedia.org/T346915) (owner: 10JMeybohm)
[07:44:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:46:42] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] Add the configuration for the new wikikube hosts in eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/958809 (https://phabricator.wikimedia.org/T346714) (owner: 10Giuseppe Lavagetto)
[07:47:15] <wikibugs>	 (03Merged) 10jenkins-bot: Add the configuration for the new wikikube hosts in eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/958809 (https://phabricator.wikimedia.org/T346714) (owner: 10Giuseppe Lavagetto)
[07:47:26] <logmsgbot>	 !log taavi@deploy2002 Finished scap: Backport for [[gerrit:959986|Make sure different key values are handled while submitting (T345496)]] (duration: 30m 55s)
[07:47:33] <stashbot>	 T345496: If a user tries to place two of the same tag, should show a warning or silently delete one tag - https://phabricator.wikimedia.org/T345496
[07:47:40] <taavi>	 Sohom_Datta: MPGuy2824: your change is now live
[07:48:23] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:49:08] <taavi>	 !log drop cloudmetrics exceptions from cr firewall ACLs https://gerrit.wikimedia.org/r/c/operations/homer/public/+/960027 T326266
[07:49:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:49:14] <stashbot>	 T326266: Remove the WMCS statsd/Graphite service - https://phabricator.wikimedia.org/T326266
[07:49:23] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:49:35] <jinxer-wm>	 (KubernetesAPILatency) resolved: (5) High Kubernetes API latency (PUT deployments) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:50:34] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes2010.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2010.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[07:50:49] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:51:17] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:53:01] <MPGuy2824>	 taavi danke
[07:53:15] <Sohom_Datta>	 Thank you :)
[07:58:34] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubernetes2010.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2010.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[07:58:49] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes2010.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2010.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[08:00:04] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubernetes2010.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2010.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[08:01:13] <jayme>	 !log cordoning kubernetes2010
[08:01:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:49] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes2010.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2010.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[08:07:59] <effie>	 jouncebot: next
[08:07:59] <jouncebot>	 In 1 hour(s) and 52 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230925T1000)
[08:08:04] <effie>	 jouncebot: nw
[08:08:07] <effie>	 jouncebot: now
[08:08:08] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 51 minute(s)
[08:11:57] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Juniper RA receive bug CVE-2023-28981 - https://phabricator.wikimedia.org/T334916 (10ayounsi) 05Open→03Resolved a:03ayounsi Deployed
[08:14:19] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958962 (owner: 10Muehlenhoff)
[08:18:09] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] thanos: remove thanos components from thanos::frontend role [puppet] - 10https://gerrit.wikimedia.org/r/956906 (https://phabricator.wikimedia.org/T346143) (owner: 10Filippo Giunchedi)
[08:18:53] <jinxer-wm>	 (PuppetDisabled) firing: (2) Puppet disabled on puppetdb1002:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=puppet&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled
[08:19:03] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubernetes2010.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2010.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[08:21:42] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] kubernetes: add kubernetes10[27-56] to wikikube [puppet] - 10https://gerrit.wikimedia.org/r/958810 (https://phabricator.wikimedia.org/T346714) (owner: 10Giuseppe Lavagetto)
[08:22:46] <wikibugs>	 (03PS1) 10Muehlenhoff: Mark mediawiki-testers as deprecated [puppet] - 10https://gerrit.wikimedia.org/r/960546 (https://phabricator.wikimedia.org/T276465)
[08:22:55] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] kubernetes: add kubernetes10[27-56] to wikikube [puppet] - 10https://gerrit.wikimedia.org/r/958810 (https://phabricator.wikimedia.org/T346714) (owner: 10Giuseppe Lavagetto)
[08:24:40] <wikibugs>	 (03PS1) 10Muehlenhoff: Mark pentesters as deprecated [puppet] - 10https://gerrit.wikimedia.org/r/960547 (https://phabricator.wikimedia.org/T276465)
[08:27:44] <jayme>	 !log draining kubernetes2010.codfw.wmnet - T347267
[08:27:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:27:51] <stashbot>	 T347267: kubernetes2010 down - https://phabricator.wikimedia.org/T347267
[08:28:18] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] Add nodejs 18 images on Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/960544 (owner: 10Elukey)
[08:28:47] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:30:15] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:30:38] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove traceback-roots [puppet] - 10https://gerrit.wikimedia.org/r/960548 (https://phabricator.wikimedia.org/T276465)
[08:31:20] <wikibugs>	 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations, 10Patch-For-Review: puppet admin module: Assign approvers to unix groups - https://phabricator.wikimedia.org/T276465 (10MoritzMuehlenhoff)
[08:39:03] <logmsgbot>	 !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on kubernetes2010.codfw.wmnet with reason: host is down
[08:39:19] <logmsgbot>	 !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on kubernetes2010.codfw.wmnet with reason: host is down
[08:43:16] <logmsgbot>	 !log jayme@cumin1001 conftool action : set/pooled=no; selector: name=kubernetes2010.*
[08:43:44] <jayme>	 !log jayme@cumin1001 conftool action : set/pooled=no; selector: name=kubernetes2010.* - T347267
[08:43:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:51] <stashbot>	 T347267: kubernetes2010 down - https://phabricator.wikimedia.org/T347267
[08:44:03] <wikibugs>	 10ops-codfw, 10DC-Ops, 10serviceops: kubernetes2010 down - https://phabricator.wikimedia.org/T347267 (10JMeybohm) Hey DC-Ops, could you please check on kubernetes2010
[08:45:07] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:46:35] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:48:08] <wikibugs>	 (03PS1) 10Effie Mouzeli: site.pp: fix typo for new kubernetes hosts [puppet] - 10https://gerrit.wikimedia.org/r/960549
[08:48:51] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] site.pp: fix typo for new kubernetes hosts [puppet] - 10https://gerrit.wikimedia.org/r/960549 (owner: 10Effie Mouzeli)
[08:48:59] <wikibugs>	 (03PS3) 10Elukey: profile::trafficserver::backend: switch ores traffic to ores-legacy [puppet] - 10https://gerrit.wikimedia.org/r/959762 (https://phabricator.wikimedia.org/T341696)
[08:49:05] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] site.pp: fix typo for new kubernetes hosts [puppet] - 10https://gerrit.wikimedia.org/r/960549 (owner: 10Effie Mouzeli)
[08:53:52] <Amir1>	 vgutierrez: hey, from traffic side: Is it fine to just merge this patch? https://gerrit.wikimedia.org/r/c/operations/puppet/+/959762
[08:54:10] <Amir1>	 or should I do some dance like disabling puppet on lvs or etc
[08:54:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:54:42] <elukey>	 Amir1: in theory we should just run puppet on the cp nodes, or let it run and see traffic gradually migrates
[08:55:12] <wikibugs>	 (03PS1) 10Isabelle Hurbain-Palatin: Enable Parsoid support for Kartographer on enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960552 (https://phabricator.wikimedia.org/T342871)
[08:55:16] <elukey>	 I don't see anything ongoing for traffic on sal
[08:55:36] <Amir1>	 yeah
[08:55:57] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:56:22] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on 16 hosts with reason: Schema change
[08:56:35] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on 16 hosts with reason: Schema change
[08:57:17] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[08:57:20] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[08:57:21] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:57:21] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 15 hosts with reason: Maintenance
[08:57:33] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 15 hosts with reason: Maintenance
[08:57:42] <wikibugs>	 (03PS4) 10Ladsgroup: profile::trafficserver::backend: switch ores traffic to ores-legacy [puppet] - 10https://gerrit.wikimedia.org/r/959762 (https://phabricator.wikimedia.org/T341696) (owner: 10Elukey)
[08:57:45] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:57:48] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] profile::trafficserver::backend: switch ores traffic to ores-legacy [puppet] - 10https://gerrit.wikimedia.org/r/959762 (https://phabricator.wikimedia.org/T341696) (owner: 10Elukey)
[08:58:52] <elukey>	 !log migrate ores.wikimedia.org's ATS backend to ores-legacy.discovery.wmnet (k8s app) - This will drain traffic to ORES bare metal nodes - T341696
[08:58:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:58:59] <stashbot>	 T341696: Zero traffic on bare metal ORES servers - https://phabricator.wikimedia.org/T341696
[08:59:01] <elukey>	 Amir1: logged --^
[08:59:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:59:26] <Amir1>	 !log by the power vested in my be Chris Albon and ML team, I now pronounce ORES dead.
[08:59:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:59:42] <Amir1>	 elukey: That's logging ^ :P
[08:59:52] <elukey>	 :D :D :D :D
[09:00:13] <jynus>	 oh, wow
[09:00:19] <jynus>	 is that true?
[09:01:03] <Amir1>	 the bare metal infra is depooled now, calls still go to lift wing via an adapter service
[09:01:27] <Amir1>	 but plug of that will be also pulled eventually 
[09:01:28] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Juniper RA receive bug CVE-2023-28981 - https://phabricator.wikimedia.org/T334916 (10ayounsi) This might need to be rolled back the day we start doing BGP unnumbered between spine and leaf as it seems to rely on it: https://www.theasciiconstruct.com/post/junos-b...
[09:01:31] <elukey>	 jynus: we have https://ores-legacy.wikimedia.org/ that is on k8s, and it calls lift wing behind the scenes
[09:01:50] <wikibugs>	 (03CR) 10Mabualruz: [C: 03+1] Enable Parsoid support for Kartographer on enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960552 (https://phabricator.wikimedia.org/T342871) (owner: 10Isabelle Hurbain-Palatin)
[09:02:11] <elukey>	 jynus: the goal is also to deprecate ores-legacy once everybody is on Lift Wing
[09:02:25] <elukey>	 (we do it in two steps to drain ores bare metal and decom those servers)
[09:03:23] <Amir1>	 if you need help decommissioning those server, you know who you gonna call elukey ?
[09:04:05] <Amir1>	 the part I'm happy about is that mw support of ores already switched to LW directly without even needing to go through adapter 
[09:04:13] <elukey>	 Amir1: you can remove the "if you need" part, you have to do it with me and Tobias :)
[09:04:39] <Amir1>	 awesome. Just drop me the tickets :D
[09:05:03] <elukey>	 yep! Now I'll keep watching the ores-legacy dashboard for troubles, I know some will arise
[09:06:02] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[09:06:15] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[09:06:17] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 13 hosts with reason: Maintenance
[09:06:39] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 13 hosts with reason: Maintenance
[09:08:34] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: thumbor: use base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/959948 (https://phabricator.wikimedia.org/T343025)
[09:08:36] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: apertium: use base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/960543 (https://phabricator.wikimedia.org/T343025)
[09:08:52] <icinga-wm>	 RECOVERY - Disk space on archiva1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[09:09:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] apertium: use base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/960543 (https://phabricator.wikimedia.org/T343025) (owner: 10Giuseppe Lavagetto)
[09:11:57] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1183.eqiad.wmnet with reason: Maintenance
[09:12:10] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1183.eqiad.wmnet with reason: Maintenance
[09:12:11] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 14 hosts with reason: Maintenance
[09:12:23] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 14 hosts with reason: Maintenance
[09:18:27] <jinxer-wm>	 (PrometheusRuleEvaluationFailures) firing: Prometheus rule evaluation failures (instance prometheus1005:9900) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures
[09:18:50] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[09:18:51] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 14 hosts with reason: Maintenance
[09:19:13] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 14 hosts with reason: Maintenance
[09:19:32] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1030.eqiad.wmnet with OS bullseye
[09:19:54] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1031.eqiad.wmnet with OS bullseye
[09:20:03] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1032.eqiad.wmnet with OS bullseye
[09:20:34] <jinxer-wm>	 (KubernetesCalicoDown) firing: (6) kubernetes1028.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[09:20:47] <wikibugs>	 (03CR) 10Btullis: "Looks good in general. Thanks brouberol. I left one genuine question about paths." [cookbooks] - 10https://gerrit.wikimedia.org/r/959720 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol)
[09:22:35] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] peopleweb: switch rsync source and dest between eqiad and codfw [puppet] - 10https://gerrit.wikimedia.org/r/959690 (https://phabricator.wikimedia.org/T345618) (owner: 10Jelto)
[09:22:47] <wikibugs>	 (03CR) 10Brouberol: [Kafka] Use broker in-sync status as a gate between broker restarts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/959720 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol)
[09:23:02] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] switch peopleweb from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/959693 (https://phabricator.wikimedia.org/T345618) (owner: 10Jelto)
[09:23:07] <wikibugs>	 (03PS2) 10Jelto: switch peopleweb from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/959693 (https://phabricator.wikimedia.org/T345618)
[09:23:27] <jinxer-wm>	 (PrometheusRuleEvaluationFailures) resolved: (8) Prometheus rule evaluation failures (instance prometheus1005:9900) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures
[09:23:44] <jinxer-wm>	 (PuppetConstantChange) firing: (2) Puppet performing a change on every puppet run - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[09:24:22] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1126.eqiad.wmnet with reason: Maintenance
[09:24:35] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1126.eqiad.wmnet with reason: Maintenance
[09:24:37] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 17 hosts with reason: Maintenance
[09:24:50] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 17 hosts with reason: Maintenance
[09:25:01] <wikibugs>	 (03PS12) 10Brouberol: [Kafka] Use broker in-sync status as a gate between broker restarts [cookbooks] - 10https://gerrit.wikimedia.org/r/959720 (https://phabricator.wikimedia.org/T346741)
[09:25:34] <jinxer-wm>	 (KubernetesCalicoDown) resolved: (6) kubernetes1028.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[09:26:13] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Temporarily adjust EVPN outbound policy to CRs to block existing nets [homer/public] - 10https://gerrit.wikimedia.org/r/960109 (https://phabricator.wikimedia.org/T347191) (owner: 10Cathal Mooney)
[09:26:27] <wikibugs>	 (03CR) 10Brouberol: "@" [cookbooks] - 10https://gerrit.wikimedia.org/r/959720 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol)
[09:27:54] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:28:49] <vgutierrez>	 Amir1: sorry...day off here. Nope, just merge it and puppet will take care
[09:29:01] <Amir1>	 awesome. thanks.
[09:29:19] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43508/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/958981 (https://phabricator.wikimedia.org/T346893) (owner: 10Cwhite)
[09:30:27] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance
[09:30:41] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance
[09:30:42] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db[1137,1216,1220,1225].eqiad.wmnet,dbstore1005.eqiad.wmnet with reason: Maintenance
[09:30:47] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db[1137,1216,1220,1225].eqiad.wmnet,dbstore1005.eqiad.wmnet with reason: Maintenance
[09:33:34] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1030.eqiad.wmnet with reason: host reimage
[09:33:52] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] switch peopleweb from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/959693 (https://phabricator.wikimedia.org/T345618) (owner: 10Jelto)
[09:33:59] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1032.eqiad.wmnet with reason: host reimage
[09:34:08] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1031.eqiad.wmnet with reason: host reimage
[09:34:17] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] peopleweb: switch rsync source and dest between eqiad and codfw [puppet] - 10https://gerrit.wikimedia.org/r/959690 (https://phabricator.wikimedia.org/T345618) (owner: 10Jelto)
[09:36:45] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1030.eqiad.wmnet with reason: host reimage
[09:37:16] <wikibugs>	 (03CR) 10Ayounsi: Support configuration of EVPN anycast GW on switches (033 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/959873 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney)
[09:38:36] <jelto>	 !log switch people.wikimedia.org to codfw - T345618
[09:38:39] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1032.eqiad.wmnet with reason: host reimage
[09:38:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:38:42] <stashbot>	 T345618: Switchover people.wikimedia.org - September 2023 - https://phabricator.wikimedia.org/T345618
[09:41:21] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1031.eqiad.wmnet with reason: host reimage
[09:43:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on puppetdb1002.eqiad.wmnet with reason: Disable puppetdb/postgres/nginx on old nodes to ensure nothing hits them anyway
[09:43:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on puppetdb1002.eqiad.wmnet with reason: Disable puppetdb/postgres/nginx on old nodes to ensure nothing hits them anyway
[09:43:27] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=73525cca-1535-4d44-89d8-fcd584ea67a9) set by jmm@cumin2002 for...
[09:43:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on puppetdb2002.codfw.wmnet with reason: Disable puppetdb/postgres/nginx on old nodes to ensure nothing hits them anyway
[09:43:33] <wikibugs>	 (03CR) 10Volans: "Thanks for migrating to the batch classes, the approach looks sane, few suggestions inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/959720 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol)
[09:43:43] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on puppetdb2002.codfw.wmnet with reason: Disable puppetdb/postgres/nginx on old nodes to ensure nothing hits them anyway
[09:43:53] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=69921077-8a56-48de-9905-0d3d1b91d292) set by jmm@cumin2002 for...
[09:45:31] <wikibugs>	 10SRE, 10Data-Persistence, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10Jelto)
[09:45:44] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "With tox v3, I have confirmed there is no change in the configuration (using `tox --showconf`) and all environments have `usedevelop`:" [software/conftool] - 10https://gerrit.wikimedia.org/r/960068 (https://phabricator.wikimedia.org/T346238) (owner: 10Hashar)
[09:46:33] <wikibugs>	 (03PS1) 10Urbanecm: AddImageFeedbackHandler: Add missing parameters [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959993 (https://phabricator.wikimedia.org/T346277)
[09:47:25] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1020.eqiad.wmnet with reason: Maintenance
[09:47:38] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1020.eqiad.wmnet with reason: Maintenance
[09:47:40] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on es[1021-1022].eqiad.wmnet with reason: Maintenance
[09:47:54] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on es[1021-1022].eqiad.wmnet with reason: Maintenance
[09:49:27] <jinxer-wm>	 (PrometheusRuleEvaluationFailures) firing: (4) Prometheus rule evaluation failures (instance prometheus1006:9900) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures
[09:50:33] <urbanecm>	 jouncebot: nowandnext
[09:50:33] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 9 minute(s)
[09:50:33] <jouncebot>	 In 0 hour(s) and 9 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230925T1000)
[09:51:28] <wikibugs>	 (03PS2) 10JMeybohm: prometheus::k8s: Drop puppet class names [puppet] - 10https://gerrit.wikimedia.org/r/960055 (https://phabricator.wikimedia.org/T346915)
[09:52:27] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1022.eqiad.wmnet with reason: Maintenance
[09:52:30] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1022.eqiad.wmnet with reason: Maintenance
[09:52:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1022 (T344589)', diff saved to https://phabricator.wikimedia.org/P52596 and previous config saved to /var/cache/conftool/dbconfig/20230925-095235-ladsgroup.json
[09:53:41] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1030.eqiad.wmnet with OS bullseye
[09:53:58] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[09:53:58] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:54:04] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] prometheus::k8s: Drop puppet class names [puppet] - 10https://gerrit.wikimedia.org/r/960055 (https://phabricator.wikimedia.org/T346915) (owner: 10JMeybohm)
[09:54:27] <jinxer-wm>	 (PrometheusRuleEvaluationFailures) resolved: (7) Prometheus rule evaluation failures (instance prometheus1005:9900) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures
[09:55:54] <wikibugs>	 (03PS1) 10Mhorsey: Enable Campaigns email on test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960559 (https://phabricator.wikimedia.org/T347065)
[09:56:30] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:56:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Enable Campaigns email on test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960559 (https://phabricator.wikimedia.org/T347065) (owner: 10Mhorsey)
[09:56:57] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1032.eqiad.wmnet with OS bullseye
[09:58:38] <wikibugs>	 10SRE, 10Wikimedia-Etherpad, 10collaboration-services: Upgrade etherpad.wikimedia.org to v1.9.3 - https://phabricator.wikimedia.org/T316421 (10LSobanski)
[09:59:36] <wikibugs>	 10SRE, 10Wikimedia-Etherpad, 10collaboration-services: Upgrade etherpad.wikimedia.org to v1.9.3 - https://phabricator.wikimedia.org/T316421 (10LSobanski) I updated the description to reflect the new Etherpad release (1.9.2). See below for a list of changes:  * Compability changes ** express-rate-limit has be...
[09:59:40] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1031.eqiad.wmnet with OS bullseye
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230925T1000)
[10:00:07] <wikibugs>	 (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/959981
[10:01:59] <wikibugs>	 (03PS2) 10Mhorsey: Enable Campaigns email on test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960559 (https://phabricator.wikimedia.org/T347065)
[10:03:41] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1033.eqiad.wmnet with OS bullseye
[10:03:52] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1034.eqiad.wmnet with OS bullseye
[10:03:59] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1035.eqiad.wmnet with OS bullseye
[10:04:05] <wikibugs>	 (03PS13) 10Brouberol: [sre.kafka] Use broker in-sync status as a gate between broker restarts [cookbooks] - 10https://gerrit.wikimedia.org/r/959720 (https://phabricator.wikimedia.org/T346741)
[10:04:08] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1036.eqiad.wmnet with OS bullseye
[10:04:15] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1037.eqiad.wmnet with OS bullseye
[10:04:23] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1038.eqiad.wmnet with OS bullseye
[10:04:31] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1036.eqiad.wmnet with OS bullseye
[10:04:35] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1039.eqiad.wmnet with OS bullseye
[10:04:45] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1040.eqiad.wmnet with OS bullseye
[10:04:46] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1038.eqiad.wmnet with OS bullseye
[10:04:49] <wikibugs>	 (03CR) 10Brouberol: "Thanks for the review Volans! I tried to address your remarks, questions and nits!" [cookbooks] - 10https://gerrit.wikimedia.org/r/959720 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol)
[10:05:04] <wikibugs>	 (03CR) 10Mhorsey: [C: 04-1] "DO NOT MERGE UNTIL RELEASE" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960559 (https://phabricator.wikimedia.org/T347065) (owner: 10Mhorsey)
[10:05:06] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1041.eqiad.wmnet with OS bullseye
[10:05:18] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1042.eqiad.wmnet with OS bullseye
[10:06:15] <wikibugs>	 (03PS1) 10Filippo Giunchedi: o11y: add some leeway for PrometheusRuleEvaluationFailures [alerts] - 10https://gerrit.wikimedia.org/r/960560 (https://phabricator.wikimedia.org/T347167)
[10:07:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [sre.kafka] Use broker in-sync status as a gate between broker restarts [cookbooks] - 10https://gerrit.wikimedia.org/r/959720 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol)
[10:08:10] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1036.eqiad.wmnet with OS bullseye
[10:08:18] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1036.eqiad.wmnet with OS bullseye
[10:09:27] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1038.eqiad.wmnet with OS bullseye
[10:09:35] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1038.eqiad.wmnet with OS bullseye
[10:10:02] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] o11y: add some leeway for PrometheusRuleEvaluationFailures [alerts] - 10https://gerrit.wikimedia.org/r/960560 (https://phabricator.wikimedia.org/T347167) (owner: 10Filippo Giunchedi)
[10:12:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: P:prometheus::ops: convert to using wmflib::get_clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[10:17:34] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1033.eqiad.wmnet with reason: host reimage
[10:17:44] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1034.eqiad.wmnet with reason: host reimage
[10:17:50] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1035.eqiad.wmnet with reason: host reimage
[10:18:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1022 (T344589)', diff saved to https://phabricator.wikimedia.org/P52597 and previous config saved to /var/cache/conftool/dbconfig/20230925-101824-ladsgroup.json
[10:18:25] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1037.eqiad.wmnet with reason: host reimage
[10:18:44] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1040.eqiad.wmnet with reason: host reimage
[10:18:50] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1039.eqiad.wmnet with reason: host reimage
[10:19:14] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1041.eqiad.wmnet with reason: host reimage
[10:19:18] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1042.eqiad.wmnet with reason: host reimage
[10:20:06] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1040.eqiad.wmnet with reason: host reimage
[10:20:07] <wikibugs>	 (03CR) 10Cathal Mooney: Support configuration of EVPN anycast GW on switches (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/959873 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney)
[10:20:07] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1033.eqiad.wmnet with reason: host reimage
[10:22:33] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1035.eqiad.wmnet with reason: host reimage
[10:23:04] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1041.eqiad.wmnet with reason: host reimage
[10:23:38] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1043.eqiad.wmnet with OS bullseye
[10:23:52] <wikibugs>	 (03CR) 10Cathal Mooney: Support configuration of EVPN anycast GW on switches (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/959873 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney)
[10:23:57] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1052.eqiad.wmnet with OS bullseye
[10:24:19] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1044.eqiad.wmnet with OS bullseye
[10:25:00] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1039.eqiad.wmnet with reason: host reimage
[10:25:00] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1045.eqiad.wmnet with OS bullseye
[10:25:03] <wikibugs>	 (03CR) 10Cathal Mooney: Support configuration of EVPN anycast GW on switches (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/959873 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney)
[10:25:13] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1046.eqiad.wmnet with OS bullseye
[10:25:32] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1047.eqiad.wmnet with OS bullseye
[10:25:47] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1048.eqiad.wmnet with OS bullseye
[10:25:55] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1047.eqiad.wmnet with OS bullseye
[10:25:59] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1049.eqiad.wmnet with OS bullseye
[10:26:13] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1050.eqiad.wmnet with OS bullseye
[10:26:26] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1051.eqiad.wmnet with OS bullseye
[10:27:00] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1053.eqiad.wmnet with OS bullseye
[10:27:09] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1054.eqiad.wmnet with OS bullseye
[10:27:12] <wikibugs>	 (03CR) 10Cathal Mooney: Support configuration of EVPN anycast GW on switches (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/959873 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney)
[10:27:20] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1056.eqiad.wmnet with OS bullseye
[10:27:27] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1037.eqiad.wmnet with reason: host reimage
[10:27:32] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1055.eqiad.wmnet with OS bullseye
[10:27:34] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1042.eqiad.wmnet with reason: host reimage
[10:27:45] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1034.eqiad.wmnet with reason: host reimage
[10:29:01] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Temporarily adjust EVPN outbound policy to CRs to block existing nets [homer/public] - 10https://gerrit.wikimedia.org/r/960109 (https://phabricator.wikimedia.org/T347191) (owner: 10Cathal Mooney)
[10:29:34] <wikibugs>	 (03Merged) 10jenkins-bot: Temporarily adjust EVPN outbound policy to CRs to block existing nets [homer/public] - 10https://gerrit.wikimedia.org/r/960109 (https://phabricator.wikimedia.org/T347191) (owner: 10Cathal Mooney)
[10:29:55] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-2] "pending deployment date definition" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960545 (https://phabricator.wikimedia.org/T345940) (owner: 10Urbanecm)
[10:31:43] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1047.eqiad.wmnet with OS bullseye
[10:33:10] <icinga-wm>	 PROBLEM - Host kubernetes1040 is DOWN: PING CRITICAL - Packet loss = 100%
[10:33:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1022', diff saved to https://phabricator.wikimedia.org/P52599 and previous config saved to /var/cache/conftool/dbconfig/20230925-103330-ladsgroup.json
[10:33:36] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1128: Disable notification" [puppet] - 10https://gerrit.wikimedia.org/r/959994
[10:34:15] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1036.eqiad.wmnet with OS bullseye
[10:34:21] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1128: Disable notification" [puppet] - 10https://gerrit.wikimedia.org/r/959994 (owner: 10Marostegui)
[10:34:36] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1038.eqiad.wmnet with OS bullseye
[10:34:38] <volans>	 effie: I'm a bit afraid that you're running them too closely between each other and many will fail to downtime and potentially other steps
[10:34:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1128', diff saved to https://phabricator.wikimedia.org/P52600 and previous config saved to /var/cache/conftool/dbconfig/20230925-103454-root.json
[10:35:34] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes1035.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=kubernetes1035.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[10:35:48] <icinga-wm>	 RECOVERY - Host kubernetes1040 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms
[10:36:35] <effie>	 volans: I realised it quite late and teh hard way 
[10:36:53] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1042.eqiad.wmnet with OS bullseye
[10:37:13] <volans>	 each downtime during reimage requires a puppet run on the active alert host and each run take ~2.5 minutes
[10:37:16] <volans>	 *akes
[10:37:30] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1040.eqiad.wmnet with OS bullseye
[10:37:38] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1043.eqiad.wmnet with reason: host reimage
[10:38:15] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1044.eqiad.wmnet with reason: host reimage
[10:38:50] <effie>	 volans: we will take the alert hit, I will prep a patch to add that info in the help info, as it gets forgotten all the time 
[10:38:54] <effie>	 sorry about that 
[10:38:56] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1045.eqiad.wmnet with reason: host reimage
[10:39:07] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:39:09] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1046.eqiad.wmnet with reason: host reimage
[10:39:42] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1041.eqiad.wmnet with OS bullseye
[10:39:43] <volans>	 effie: thanks! FYI we'll shortly have locking support so we would be able to avoid some of those failure
[10:40:02] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1052.eqiad.wmnet with reason: host reimage
[10:40:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:40:08] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1048.eqiad.wmnet with reason: host reimage
[10:40:14] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1049.eqiad.wmnet with reason: host reimage
[10:40:25] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1050.eqiad.wmnet with reason: host reimage
[10:40:34] <jinxer-wm>	 (KubernetesCalicoDown) firing: (2) kubernetes1034.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[10:40:42] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1043.eqiad.wmnet with reason: host reimage
[10:40:46] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1035.eqiad.wmnet with OS bullseye
[10:40:56] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host deploy1002.eqiad.wmnet
[10:41:05] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1033.eqiad.wmnet with OS bullseye
[10:42:03] <wikibugs>	 10SRE-swift-storage: Swift-recon -d overstates disk capacity and usage - https://phabricator.wikimedia.org/T294016 (10MatthewVernon) 05Open→03Resolved Resolved by deploying `2.26.0-10+deb11u1+wmf1` fleet-wide.
[10:42:25] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1051.eqiad.wmnet with reason: host reimage
[10:42:56] <effie>	 volans: <3
[10:42:58] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1053.eqiad.wmnet with reason: host reimage
[10:43:08] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1045.eqiad.wmnet with reason: host reimage
[10:43:18] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1039.eqiad.wmnet with OS bullseye
[10:43:24] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1055.eqiad.wmnet with reason: host reimage
[10:43:30] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1054.eqiad.wmnet with reason: host reimage
[10:43:31] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1056.eqiad.wmnet with reason: host reimage
[10:45:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:45:06] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1049.eqiad.wmnet with reason: host reimage
[10:45:22] <wikibugs>	 (03PS1) 10Elukey: icinga/nagios: remove check_ores* [puppet] - 10https://gerrit.wikimedia.org/r/960567 (https://phabricator.wikimedia.org/T347278)
[10:45:28] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1052.eqiad.wmnet with reason: host reimage
[10:45:41] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1047.eqiad.wmnet with reason: host reimage
[10:46:35] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1037.eqiad.wmnet with OS bullseye
[10:46:50] <wikibugs>	 10SRE, 10SRE-swift-storage: Swiftrepl was stuck in an infinite loop since days - https://phabricator.wikimedia.org/T162122 (10MatthewVernon) 05Stalled→03Resolved a:03MatthewVernon We don't use swiftrepl any more, so closing this.
[10:47:37] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1050.eqiad.wmnet with reason: host reimage
[10:47:37] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1051.eqiad.wmnet with reason: host reimage
[10:47:52] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1036.eqiad.wmnet with reason: host reimage
[10:48:10] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1055.eqiad.wmnet with reason: host reimage
[10:48:15] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1044.eqiad.wmnet with reason: host reimage
[10:48:17] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1038.eqiad.wmnet with reason: host reimage
[10:48:37] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1034.eqiad.wmnet with OS bullseye
[10:48:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1022', diff saved to https://phabricator.wikimedia.org/P52601 and previous config saved to /var/cache/conftool/dbconfig/20230925-104837-ladsgroup.json
[10:49:08] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:49:09] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1046.eqiad.wmnet with reason: host reimage
[10:49:27] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host deploy1002.eqiad.wmnet
[10:50:08] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1048.eqiad.wmnet with reason: host reimage
[10:50:10] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1053.eqiad.wmnet with reason: host reimage
[10:50:34] <jinxer-wm>	 (KubernetesCalicoDown) firing: (5) kubernetes1034.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[10:50:35] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: modules: add base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/959803 (https://phabricator.wikimedia.org/T343025)
[10:50:37] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki: add support for a prometheus-statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/960568 (https://phabricator.wikimedia.org/T343025)
[10:50:58] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:51:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mediawiki: add support for a prometheus-statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/960568 (https://phabricator.wikimedia.org/T343025) (owner: 10Giuseppe Lavagetto)
[10:51:26] <wikibugs>	 (03CR) 10Volans: [sre.kafka] Use broker in-sync status as a gate between broker restarts (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/959720 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol)
[10:51:39] <jinxer-wm>	 (KeyholderUnarmed) firing: 19 unarmed Keyholder key(s) on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[10:52:06] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:52:08] <claime>	 keyholder error expected, fixing
[10:52:12] <icinga-wm>	 PROBLEM - Check size of conntrack table on kubernetes1051 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.132.28: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[10:52:20] <icinga-wm>	 PROBLEM - DPKG on kubernetes1049 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.132.26: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[10:52:33] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1047.eqiad.wmnet with reason: host reimage
[10:52:34] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Thumbor, 10Traffic: Cache thumbs in our caching infrastructure (e.g. ATS) - https://phabricator.wikimedia.org/T345334 (10MatthewVernon) > An interesting data point (that I didn't see directly in the other ticket, at least in a quick scan!) would be some idea of the curve of "i...
[10:52:55] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1038.eqiad.wmnet with reason: host reimage
[10:53:25] <icinga-wm>	 PROBLEM - Check for large files in client bucket on kubernetes1046 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.166: Connection reset by peer https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file
[10:53:30] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1054.eqiad.wmnet with reason: host reimage
[10:53:31] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1056.eqiad.wmnet with reason: host reimage
[10:54:21] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1048 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.132.25: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[10:54:45] <wikibugs>	 (03PS1) 10Elukey: Avoid pages for ores.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/960569 (https://phabricator.wikimedia.org/T347278)
[10:54:49] <icinga-wm>	 PROBLEM - Check for large files in client bucket on kubernetes1051 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.132.28: Connection reset by peer https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file
[10:54:55] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on kubernetes1051 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.132.28: Connection reset by peer https://wikitech.wikimedia.org/wiki/NTP
[10:54:56] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host poolcounter1004.eqiad.wmnet
[10:55:15] <icinga-wm>	 PROBLEM - confd service on kubernetes1044 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.164: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:55:19] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes1051 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:55:27] <icinga-wm>	 RECOVERY - Check size of conntrack table on kubernetes1051 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[10:55:34] <jinxer-wm>	 (KubernetesCalicoDown) firing: (7) kubernetes1034.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[10:55:49] <icinga-wm>	 RECOVERY - Check for large files in client bucket on kubernetes1051 is OK: OK: client bucket file ok https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file
[10:55:57] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes1046 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.166: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:56:17] <icinga-wm>	 RECOVERY - confd service on kubernetes1044 is OK: OK - confd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:56:23] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes1051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:56:30] <wikibugs>	 10SRE-swift-storage: flip/flop mounting filesystems between systemd and swift-drive-audit - https://phabricator.wikimedia.org/T265450 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Fixed with roll-out of swift version `2.26.0-10+deb11u1+wmf1` fleet-wide.
[10:56:39] <jinxer-wm>	 (KeyholderUnarmed) resolved: 19 unarmed Keyholder key(s) on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[10:56:39] <icinga-wm>	 RECOVERY - Check for large files in client bucket on kubernetes1046 is OK: OK: client bucket file ok https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file
[10:57:01] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes1048 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.132.25: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:57:11] <icinga-wm>	 PROBLEM - Check size of conntrack table on kubernetes1048 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.132.25. Check system logs on 10.64.132.25 https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[10:57:16] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1051.eqiad.wmnet with OS bullseye
[10:57:21] <icinga-wm>	 PROBLEM - Host kubernetes1049 is DOWN: PING CRITICAL - Packet loss = 100%
[10:57:29] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1036.eqiad.wmnet with reason: host reimage
[10:57:55] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1055.eqiad.wmnet with OS bullseye
[10:57:59] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes1046 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:57:59] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1045.eqiad.wmnet with OS bullseye
[10:58:11] <icinga-wm>	 RECOVERY - Check size of conntrack table on kubernetes1048 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[10:58:17] <wikibugs>	 (03PS6) 10Giuseppe Lavagetto: modules: add base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/959803 (https://phabricator.wikimedia.org/T343025)
[10:58:19] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mediawiki: add support for a prometheus-statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/960568 (https://phabricator.wikimedia.org/T343025)
[10:58:37] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host poolcounter1004.eqiad.wmnet
[10:58:48] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1043.eqiad.wmnet with OS bullseye
[10:59:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mediawiki: add support for a prometheus-statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/960568 (https://phabricator.wikimedia.org/T343025) (owner: 10Giuseppe Lavagetto)
[10:59:07] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes1048 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:59:25] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes1047 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.132.29. Check system logs on 10.64.132.29 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:59:37] <icinga-wm>	 RECOVERY - Host kubernetes1049 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms
[11:00:19] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST customresourcedefinitions) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:00:27] <wikibugs>	 (03PS3) 10Kamila Součková: geo-maps: reorder codfw/eqiad in the default [dns] - 10https://gerrit.wikimedia.org/r/959182 (https://phabricator.wikimedia.org/T346474)
[11:00:28] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host poolcounter1005.eqiad.wmnet
[11:00:34] <jinxer-wm>	 (KubernetesCalicoDown) firing: (6) kubernetes1034.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[11:01:05] <icinga-wm>	 PROBLEM - Host kubernetes1044 is DOWN: PING CRITICAL - Packet loss = 100%
[11:01:05] <icinga-wm>	 PROBLEM - Check for large files in client bucket on kubernetes1056 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.136.27: Connection reset by peer https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file
[11:01:09] <icinga-wm>	 PROBLEM - Host kubernetes1050 is DOWN: PING CRITICAL - Packet loss = 100%
[11:01:51] <icinga-wm>	 PROBLEM - Host kubernetes1048 is DOWN: PING CRITICAL - Packet loss = 100%
[11:02:15] <icinga-wm>	 RECOVERY - Check for large files in client bucket on kubernetes1056 is OK: OK: client bucket file ok https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file
[11:02:35] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes1047 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:02:53] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1049 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[11:02:53] <icinga-wm>	 RECOVERY - DPKG on kubernetes1049 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[11:03:09] <icinga-wm>	 PROBLEM - Host kubernetes1046 is DOWN: PING CRITICAL - Packet loss = 100%
[11:03:16] <wikibugs>	 (03PS2) 10Hnowlan: trafficserver: route knowledge-gap path via rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/946928 (https://phabricator.wikimedia.org/T342213)
[11:03:16] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1056.eqiad.wmnet with OS bullseye
[11:03:25] <icinga-wm>	 RECOVERY - Host kubernetes1050 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms
[11:03:40] <wikibugs>	 (03PS2) 10Hnowlan: trafficserver: route requests to mediarequests service [puppet] - 10https://gerrit.wikimedia.org/r/956909 (https://phabricator.wikimedia.org/T336380)
[11:03:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1022 (T344589)', diff saved to https://phabricator.wikimedia.org/P52602 and previous config saved to /var/cache/conftool/dbconfig/20230925-110343-ladsgroup.json
[11:04:01] <icinga-wm>	 RECOVERY - Host kubernetes1044 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[11:04:33] <icinga-wm>	 RECOVERY - Host kubernetes1048 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms
[11:04:46] <wikibugs>	 (03PS2) 10Hnowlan: api-gateway: emit cache-control header for 404s [deployment-charts] - 10https://gerrit.wikimedia.org/r/956833 (https://phabricator.wikimedia.org/T336400)
[11:05:13] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes1049 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:05:15] <icinga-wm>	 PROBLEM - Host kubernetes1053 is DOWN: PING CRITICAL - Packet loss = 100%
[11:05:19] <jinxer-wm>	 (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (LIST customresourcedefinitions) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:05:24] <wikibugs>	 10SRE, 10GrowthExperiments-Homepage, 10GrowthExperiments-ImpactModule, 10serviceops, and 2 others: RefreshUserImpactJob consumes too many file descriptors - https://phabricator.wikimedia.org/T344428 (10LSobanski)
[11:05:25] <icinga-wm>	 RECOVERY - Host kubernetes1046 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[11:05:30] <wikibugs>	 (03PS1) 10Majavah: hieradata: update ns0.openstack address [puppet] - 10https://gerrit.wikimedia.org/r/960570
[11:05:34] <jinxer-wm>	 (KubernetesCalicoDown) firing: (5) kubernetes1050.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[11:05:34] <wikibugs>	 10SRE, 10Traffic: Add README and build-specific Dockerfile to purged - https://phabricator.wikimedia.org/T347021 (10LSobanski)
[11:05:49] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1048 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[11:05:52] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host poolcounter1005.eqiad.wmnet
[11:06:01] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1048.eqiad.wmnet with OS bullseye
[11:06:27] <icinga-wm>	 PROBLEM - Host kubernetes1047 is DOWN: PING CRITICAL - Packet loss = 100%
[11:07:23] <icinga-wm>	 RECOVERY - Host kubernetes1053 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[11:07:29] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic-Icebox, 10Wikimedia-Performance-recommendation, 10affects-Kiwix-and-openZIM: Swift sends ETAG without double-quotes - https://phabricator.wikimedia.org/T256217 (10MatthewVernon) I can confirm that we're running a new-enough swift version everywhere that we //could//...
[11:07:33] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1044.eqiad.wmnet with OS bullseye
[11:08:25] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1049.eqiad.wmnet with OS bullseye
[11:08:33] <icinga-wm>	 RECOVERY - Host kubernetes1047 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms
[11:08:35] <icinga-wm>	 PROBLEM - Host kubernetes1054 is DOWN: PING CRITICAL - Packet loss = 100%
[11:08:47] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1052.eqiad.wmnet with OS bullseye
[11:09:03] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1046.eqiad.wmnet with OS bullseye
[11:09:35] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1050 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[11:09:35] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes1050 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:10:34] <jinxer-wm>	 (KubernetesCalicoDown) resolved: (5) kubernetes1050.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[11:10:45] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1047 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[11:10:45] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes1047 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:10:49] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1053 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[11:11:22] <wikibugs>	 (03PS1) 10Hnowlan: rest-gateway: only pass requests for knowledge-gap on wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/960575 (https://phabricator.wikimedia.org/T342213)
[11:11:27] <wikibugs>	 (03PS1) 10Majavah: hieradata: remove more cloudmetrics1003 references [puppet] - 10https://gerrit.wikimedia.org/r/960576
[11:11:47] <icinga-wm>	 RECOVERY - Host kubernetes1054 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms
[11:12:31] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1050.eqiad.wmnet with OS bullseye
[11:13:05] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1038.eqiad.wmnet with OS bullseye
[11:13:26] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb1014.eqiad.wmnet
[11:13:27] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes1053 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:13:31] <wikibugs>	 (03PS2) 10Majavah: hieradata: remove more cloudmetrics1003 references [puppet] - 10https://gerrit.wikimedia.org/r/960576 (https://phabricator.wikimedia.org/T326266)
[11:14:57] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:16:06] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1047.eqiad.wmnet with OS bullseye
[11:16:12] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1053.eqiad.wmnet with OS bullseye
[11:16:25] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes1054 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:16:47] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1036.eqiad.wmnet with OS bullseye
[11:17:47] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1054 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[11:19:25] <wikibugs>	 10SRE, 10ExternalGuidance, 10Language-Team, 10Traffic-Icebox: Deliver mobile-based version for automatic translations - https://phabricator.wikimedia.org/T212197 (10Pginer-WMF) 05Open→03Resolved a:03Pginer-WMF I think the task can be closed, and focus future efforts in {T280430}  The new Vector skin...
[11:19:58] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1014.eqiad.wmnet
[11:20:17] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb1013.eqiad.wmnet
[11:20:22] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1054.eqiad.wmnet with OS bullseye
[11:25:24] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1013.eqiad.wmnet
[11:26:46] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb1012.eqiad.wmnet
[11:33:02] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1012.eqiad.wmnet
[11:33:50] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.remove-downtime for 30 hosts
[11:34:00] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 30 hosts
[11:35:54] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1052 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[11:36:24] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes1047 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:36:24] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes1049 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:36:24] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes1054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:36:24] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes1050 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:36:24] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes1053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:36:33] <wikibugs>	 10SRE, 10GrowthExperiments-Homepage, 10GrowthExperiments-ImpactModule, 10serviceops, and 2 others: RefreshUserImpactJob consumes too many file descriptors - https://phabricator.wikimedia.org/T344428 (10Urbanecm_WMF) Thanks for the advice @joe!  > What I fail to understand is how, if this was an open file l...
[11:36:46] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1047 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[11:36:46] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1049 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[11:36:46] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1050 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[11:36:46] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1052 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[11:36:46] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1053 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[11:36:47] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1054 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[11:37:41] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb1011.eqiad.wmnet
[11:38:14] <wikibugs>	 (03PS3) 10JMeybohm: conftool: add new k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/958811 (https://phabricator.wikimedia.org/T346714) (owner: 10Giuseppe Lavagetto)
[11:40:53] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] conftool: add new k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/958811 (https://phabricator.wikimedia.org/T346714) (owner: 10Giuseppe Lavagetto)
[11:41:01] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+2] kubernetes::node: Reserve CPU resources for system daemons [puppet] - 10https://gerrit.wikimedia.org/r/959164 (https://phabricator.wikimedia.org/T277876) (owner: 10JMeybohm)
[11:41:07] <jinxer-wm>	 (ProbeDown) firing: Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:41:31] <_joe_>	 uh
[11:41:37] <jynus>	 ^ is this maintenance?
[11:41:40] <jynus>	 I will ack
[11:41:48] <godog>	 here too
[11:41:52] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:41:53] <godog>	 checking
[11:41:55] <_joe_>	 jynus: not that i know of
[11:42:08] <jynus>	 !incidents
[11:42:12] <_joe_>	 is it eqiad?
[11:42:23] <godog>	 yeah, eqiad and recovering
[11:42:31] <jynus>	 it was acked
[11:42:39] <jynus>	 ip4 eqiad
[11:42:43] <jayme>	 !log running puppet on lvs in codfw - T346714
[11:42:48] <_joe_>	 yeah it's probably due to the rdb server reboot
[11:42:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:42:51] <stashbot>	 T346714: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714
[11:43:04] <_joe_>	 jayme: I guess wrong DC?
[11:43:10] <jayme>	 !log running puppet on lvs in eqiad - T346714 (TYPO from above, did not run in codfw)
[11:43:14] <jayme>	 nope :)
[11:43:15] <wikibugs>	 (03CR) 10Ladsgroup: "would it make sense to remove the probe as well? to remove the health checks" [puppet] - 10https://gerrit.wikimedia.org/r/960569 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[11:43:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:43:56] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1011.eqiad.wmnet
[11:44:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:45:05] <godog>	 I'll take a look at the logs
[11:45:51] <claime>	 Yeah docker is probably the rdb reboot
[11:46:07] <jinxer-wm>	 (ProbeDown) resolved: Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:46:15] <jynus>	 resolved, nice
[11:46:26] <claime>	 Yeah, done with that pair of reboots
[11:46:36] <claime>	 I guess it'll fail again when I do the same pair in codfw
[11:47:26] <claime>	 back in a nit
[11:47:28] <claime>	 bit
[11:50:36] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[11:50:37] <jynus>	 correct me if I am wrong, registry should only impact mw image rebuilds and things like that, no direct user impact?
[11:50:50] <jynus>	 e.g. deploys, right?
[11:51:01] <wikibugs>	 (03PS1) 10JMeybohm: mw-api-ext/mw-web: Raise main replicas to 16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/960591
[11:52:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: icinga/nagios: remove check_ores* (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960567 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[11:52:40] <jayme>	 jynus: it's accessible from external as well, so one could say there might be user impact
[11:52:51] <jynus>	 I see, thanks
[11:54:08] <jynus>	 not complaining about it alerting, just tried to asess impact- sometimes some dependencies create greater fallout than initially expected
[11:56:26] <claime>	 jynus: for context, apparently docker-registry doesn't do redis ha
[11:56:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/960576 (https://phabricator.wikimedia.org/T326266) (owner: 10Majavah)
[11:57:08] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] mw-api-ext/mw-web: Raise main replicas to 16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/960591 (owner: 10JMeybohm)
[11:57:24] <claime>	 jynus: Do you want me to downtime it before rebooting its pair in codfw?
[11:57:54] <wikibugs>	 (03Merged) 10jenkins-bot: mw-api-ext/mw-web: Raise main replicas to 16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/960591 (owner: 10JMeybohm)
[11:58:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10JMeybohm)
[11:58:52] <jynus>	 claime: no worries on my side, if I know it is going to happen- just be quick to ack on splunk, so it doesn't p* everbody :-D
[11:59:03] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-53] - https://phabricator.wikimedia.org/T342534 (10JMeybohm)
[11:59:17] <claime>	 jynus: ack, sorry for the bother
[11:59:39] <jynus>	 no issues caused, alerts are there to happen!
[12:00:13] <claime>	 I'll do that after my lunch
[12:00:17] <claime>	 :)
[12:01:01] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] hieradata: remove more cloudmetrics1003 references [puppet] - 10https://gerrit.wikimedia.org/r/960576 (https://phabricator.wikimedia.org/T326266) (owner: 10Majavah)
[12:01:39] <jayme>	 jouncebot: nowandnext
[12:01:39] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 58 minute(s)
[12:01:39] <jouncebot>	 In 0 hour(s) and 58 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230925T1300)
[12:01:55] <jynus>	 I am more interested on seeing downtimed non-useful alerts like those for new hosts that are WIP or long running maintenance, that only adds noise to observability (while the docker-registry was a real issue)
[12:02:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:02:35] <wikibugs>	 (03PS1) 10Hnowlan: mobileapps: increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/960596
[12:03:07] <jynus>	 and sometimes it is useful to see things go down and up correctly!
[12:07:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:08:06] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:08:16] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[12:10:33] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: update o11y rolemap [puppet] - 10https://gerrit.wikimedia.org/r/960599
[12:11:21] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/960548 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff)
[12:12:29] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: update o11y rolemap [puppet] - 10https://gerrit.wikimedia.org/r/960599 (owner: 10Filippo Giunchedi)
[12:12:48] <wikibugs>	 (03PS1) 10Jbond: puppet_compiler: roll back to 2.5.6 [puppet] - 10https://gerrit.wikimedia.org/r/960600 (https://phabricator.wikimedia.org/T346216)
[12:13:52] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet_compiler: roll back to 2.5.6 [puppet] - 10https://gerrit.wikimedia.org/r/960600 (https://phabricator.wikimedia.org/T346216) (owner: 10Jbond)
[12:14:25] <jbond>	 godog: happy for me to merge yours
[12:14:33] <godog>	 jbond: oops! yes thank you, I forgot
[12:14:39] <jbond>	 done
[12:14:44] <godog>	 cheers
[12:16:16] <logmsgbot>	 !log jayme@deploy2002 Started scap: (no justification provided)
[12:16:30] <wikibugs>	 (03PS1) 10Majavah: icinga: Drop monitoring for *.wmcloud.org certificate [puppet] - 10https://gerrit.wikimedia.org/r/960601 (https://phabricator.wikimedia.org/T345983)
[12:16:32] <wikibugs>	 (03PS1) 10Majavah: nagios_common: drop unused contact group [puppet] - 10https://gerrit.wikimedia.org/r/960602
[12:17:13] <jayme>	 !log bumping k8s deployment mw-web and mw-api-ext to 16 replicas each in both DCs
[12:17:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:17:53] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): deomission puppetdb[12]002 - https://phabricator.wikimedia.org/T347285 (10jbond)
[12:18:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): decomission puppetboard[12]002 - https://phabricator.wikimedia.org/T347286 (10jbond)
[12:19:54] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:20:00] <wikibugs>	 (03PS1) 10Muehlenhoff: dragonfly::supernode: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/960603
[12:20:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] dragonfly::supernode: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/960603 (owner: 10Muehlenhoff)
[12:22:50] <wikibugs>	 10ops-eqiad, 10DC-Ops: Replace RAID controller battery on an-worker1086 - https://phabricator.wikimedia.org/T347287 (10BTullis)
[12:22:53] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] hieradata: update ns0.openstack address [puppet] - 10https://gerrit.wikimedia.org/r/960570 (owner: 10Majavah)
[12:23:08] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] hieradata: update ns0.openstack address [puppet] - 10https://gerrit.wikimedia.org/r/960570 (owner: 10Majavah)
[12:23:32] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] mobileapps: increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/960596 (owner: 10Hnowlan)
[12:23:52] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Replace RAID controller battery on an-worker1086 - https://phabricator.wikimedia.org/T347287 (10BTullis)
[12:24:08] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - mwdebug_4444: Servers kubernetes1012.eqiad.wmnet, kubernetes1019.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1021.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[12:24:12] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Replace RAID controller battery on an-worker1086 - https://phabricator.wikimedia.org/T347287 (10BTullis) p:05Triage→03Medium
[12:25:27] <wikibugs>	 (03PS2) 10FNegri: Package for Debian Bookworm [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/959212 (https://phabricator.wikimedia.org/T346762)
[12:25:29] <wikibugs>	 (03PS4) 10FNegri: d/changelog: bump version [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/959316 (owner: 10David Caro)
[12:26:25] <logmsgbot>	 !log jayme@deploy2002 Finished scap: (no justification provided) (duration: 10m 08s)
[12:26:34] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - mwdebug_4444: Servers kubernetes1022.eqiad.wmnet, kubernetes1019.eqiad.wmnet, kubernetes1010.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1015.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[12:28:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:28:49] <wikibugs>	 (03PS2) 10Muehlenhoff: dragonfly::supernode: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/960603
[12:29:43] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] otel-coll: enable prometheus scraping [deployment-charts] - 10https://gerrit.wikimedia.org/r/960056 (https://phabricator.wikimedia.org/T345712) (owner: 10Filippo Giunchedi)
[12:29:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Didn't test a build, but looks good to me in general" [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/959212 (https://phabricator.wikimedia.org/T346762) (owner: 10FNegri)
[12:30:07] <jinxer-wm>	 (ProbeDown) firing: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:31:07] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959179 (owner: 10Muehlenhoff)
[12:33:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:33:54] <wikibugs>	 (03CR) 10Hashar: python-build: provide a python2 Bullseye image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940161 (https://phabricator.wikimedia.org/T342346) (owner: 10Hashar)
[12:35:07] <jinxer-wm>	 (ProbeDown) resolved: (2) Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:36:31] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/960603 (owner: 10Muehlenhoff)
[12:38:17] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/960162 (owner: 10Majavah)
[12:38:50] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+2] P:openstack::pdns::recusor: cleanup remains from old setup [puppet] - 10https://gerrit.wikimedia.org/r/960162 (owner: 10Majavah)
[12:41:37] <jinxer-wm>	 (ProbeDown) firing: (3) Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:41:50] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:43:18] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:45:29] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1023.eqiad.wmnet with reason: Maintenance
[12:45:43] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1023.eqiad.wmnet with reason: Maintenance
[12:45:44] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on es[1024-1025].eqiad.wmnet with reason: Maintenance
[12:45:51] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960074 (https://phabricator.wikimedia.org/T308139) (owner: 10Sergio Gimeno)
[12:45:58] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on es[1024-1025].eqiad.wmnet with reason: Maintenance
[12:46:37] <jinxer-wm>	 (ProbeDown) firing: (3) Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:47:28] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] geo-maps: reorder codfw/eqiad in the default [dns] - 10https://gerrit.wikimedia.org/r/959182 (https://phabricator.wikimedia.org/T346474) (owner: 10Kamila Součková)
[12:47:51] <wikibugs>	 (03CR) 10Urbanecm: "recheck" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959987 (https://phabricator.wikimedia.org/T347120) (owner: 10Urbanecm)
[12:50:35] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+2] geo-maps: reorder codfw/eqiad in the default [dns] - 10https://gerrit.wikimedia.org/r/959182 (https://phabricator.wikimedia.org/T346474) (owner: 10Kamila Součková)
[12:52:04] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1024.eqiad.wmnet with reason: Maintenance
[12:52:06] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1024.eqiad.wmnet with reason: Maintenance
[12:52:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1024 (T344589)', diff saved to https://phabricator.wikimedia.org/P52603 and previous config saved to /var/cache/conftool/dbconfig/20230925-125212-ladsgroup.json
[12:53:29] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:54:33] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:56:21] <kamila_>	 !log put codfw before eqiad in geoDNS defaults
[12:56:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:56:37] <jinxer-wm>	 (ProbeDown) resolved: (2) Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:57:39] <wikibugs>	 (03CR) 10Muehlenhoff: python-build: provide a python2 Bullseye image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940161 (https://phabricator.wikimedia.org/T342346) (owner: 10Hashar)
[12:58:02] <urbanecm>	 jouncebot: nowandnext
[12:58:02] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 1 minute(s)
[12:58:02] <jouncebot>	 In 0 hour(s) and 1 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230925T1300)
[12:58:10] <wikibugs>	 (03PS1) 10Aqu: Bump MW Page content change app version [deployment-charts] - 10https://gerrit.wikimedia.org/r/960610 (https://phabricator.wikimedia.org/T344688)
[12:58:13] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] AddImageFeedbackHandler: Add missing parameters [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959993 (https://phabricator.wikimedia.org/T346277) (owner: 10Urbanecm)
[12:58:19] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] listTaskCounts: Do not expect tasks key to be present [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959987 (https://phabricator.wikimedia.org/T347120) (owner: 10Urbanecm)
[12:58:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Juniper RA receive bug CVE-2023-28981 - https://phabricator.wikimedia.org/T334916 (10cmooney) Hmm yeah good point.  We can probably upgrade devices to a release with the fix in it before then.
[12:59:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/960601 (https://phabricator.wikimedia.org/T345983) (owner: 10Majavah)
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230925T1300)
[13:00:05] <jouncebot>	 sergi0, ihurbain, houseofm, and Urbanecm: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:07] <jinxer-wm>	 (ProbeDown) firing: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:00:13] <urbanecm>	 I can deploy today
[13:00:17] <urbanecm>	 hi all!
[13:00:20] <ihurbain>	 i'm around :)
[13:00:22] <sergi0>	 o/
[13:00:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] nagios_common: drop unused contact group [puppet] - 10https://gerrit.wikimedia.org/r/960602 (owner: 10Majavah)
[13:00:25] <urbanecm>	 i!
[13:00:26] <urbanecm>	 hi!
[13:00:32] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] icinga: Drop monitoring for *.wmcloud.org certificate [puppet] - 10https://gerrit.wikimedia.org/r/960601 (https://phabricator.wikimedia.org/T345983) (owner: 10Majavah)
[13:00:39] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: enable AddLink backend 14th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960074 (https://phabricator.wikimedia.org/T308139) (owner: 10Sergio Gimeno)
[13:00:43] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] nagios_common: drop unused contact group [puppet] - 10https://gerrit.wikimedia.org/r/960602 (owner: 10Majavah)
[13:01:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960074 (https://phabricator.wikimedia.org/T308139) (owner: 10Sergio Gimeno)
[13:01:23] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host mwmaint1002.eqiad.wmnet
[13:01:26] <wikibugs>	 (03Merged) 10jenkins-bot: GrowthExperiments: enable AddLink backend 14th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960074 (https://phabricator.wikimedia.org/T308139) (owner: 10Sergio Gimeno)
[13:01:28] <urbanecm>	 query houseofm
[13:01:31] <urbanecm>	 eh
[13:01:36] <HouseOfM>	 o/
[13:01:42] <urbanecm>	 hi HouseOfM!
[13:01:47] <logmsgbot>	 !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:960074|GrowthExperiments: enable AddLink backend 14th round of wikis (T308139)]]
[13:02:00] <stashbot>	 T308139: Deploy "add a link" to 14th round of wikis - https://phabricator.wikimedia.org/T308139
[13:02:03] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] otel-coll: enable prometheus scraping [deployment-charts] - 10https://gerrit.wikimedia.org/r/960056 (https://phabricator.wikimedia.org/T345712) (owner: 10Filippo Giunchedi)
[13:02:34] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] "thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/959950 (https://phabricator.wikimedia.org/T345712) (owner: 10Filippo Giunchedi)
[13:03:24] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 59278
[13:03:38] <jayme>	 !log cordoned kubernetes10[27-56]
[13:03:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:13] <moritzm>	 !log installing openjdk-11 security updates on buster
[13:04:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:36] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:04:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1024 (T344589)', diff saved to https://phabricator.wikimedia.org/P52604 and previous config saved to /var/cache/conftool/dbconfig/20230925-130444-ladsgroup.json
[13:05:07] <jinxer-wm>	 (ProbeDown) resolved: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:05:37] <jinxer-wm>	 (ProbeDown) firing: (2) Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:05:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks okay" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940161 (https://phabricator.wikimedia.org/T342346) (owner: 10Hashar)
[13:05:42] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:08:04] <wikibugs>	 (03PS1) 10Majavah: P:openstack::galera: drop nrpe process check [puppet] - 10https://gerrit.wikimedia.org/r/960612 (https://phabricator.wikimedia.org/T345294)
[13:08:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] listTaskCounts: Do not expect tasks key to be present [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959987 (https://phabricator.wikimedia.org/T347120) (owner: 10Urbanecm)
[13:08:32] <urbanecm>	 still preparing the k8s image for deployment...
[13:08:44] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:09:18] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwmaint1002.eqiad.wmnet
[13:10:26] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:10:37] <jinxer-wm>	 (ProbeDown) resolved: Service mw-web:4450 has failed probes (http_mw-web_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-web:4450 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:10:44] <urbanecm>	 ehm... i got kubernetes2010.codfw.wmnet port 22: Connection timed out during a scap deployment
[13:10:49] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "Noted, thanks, still a bit blurry to me but overall it makes sens!" [homer/public] - 10https://gerrit.wikimedia.org/r/959873 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney)
[13:10:52] <urbanecm>	 what's happening?
[13:11:01] <_joe_>	 urbanecm: it's ok, that server is down
[13:11:19] <urbanecm>	 _joe_: shouldn't scap ignore it then?
[13:11:36] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:11:38] <_joe_>	 but in theory it should disappear from there yes unless I did something asinine with it :)
[13:11:44] <_joe_>	 urbanecm: did that make scap fail?
[13:11:47] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb1010.eqiad.wmnet
[13:11:54] <urbanecm>	 _joe_: no, it just yells at me and proceeds.
[13:12:01] <_joe_>	 urbanecm: ok cool
[13:12:16] <jayme>	 _joe_: it just prints an error during prefetch
[13:12:37] <_joe_>	 jayme: yeah I want to have a way to exclude a node, as I said, I'll look into it
[13:13:46] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[13:13:52] <jayme>	 !log uncordoned kubernetes10[27-56]
[13:13:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:13:57] <_joe_>	 pdb_query: "Class[Profile::Kubernetes::Mediawiki_runner] and User[mwdeploy]{ensure=present}"
[13:13:59] <_joe_>	 heh
[13:14:11] <_joe_>	 that's not great
[13:14:15] <urbanecm>	 seems like no way to exclude a node to me
[13:14:20] <urbanecm>	 but i don't know much about puppet :))
[13:14:23] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm and sgimeno: Backport for [[gerrit:960074|GrowthExperiments: enable AddLink backend 14th round of wikis (T308139)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[13:14:24] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[13:14:25] <wikibugs>	 (03PS1) 10Majavah: cloudlb: remove unused firewall rule [puppet] - 10https://gerrit.wikimedia.org/r/960614
[13:14:29] <stashbot>	 T308139: Deploy "add a link" to 14th round of wikis - https://phabricator.wikimedia.org/T308139
[13:14:39] <urbanecm>	 sergi0: anyway, available at mwdebug now. but afaik nothing can be tested, right?
[13:14:40] <jayme>	 !log ran homer "lsw1-*eqiad*" commit - T346714
[13:14:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:46] <stashbot>	 T346714: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714
[13:14:58] <_joe_>	 urbanecm: we'll think of a solution; in the meantime proceed please
[13:15:05] <sergi0>	 urbanecm: right, nothing to test
[13:15:10] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm and sgimeno: Continuing with sync
[13:15:15] <urbanecm>	 thanks _joe_, proceeding.
[13:15:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "♥" [puppet] - 10https://gerrit.wikimedia.org/r/960612 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah)
[13:15:28] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[13:15:28] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:15:51] <wikibugs>	 (03Merged) 10jenkins-bot: AddImageFeedbackHandler: Add missing parameters [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959993 (https://phabricator.wikimedia.org/T346277) (owner: 10Urbanecm)
[13:15:54] <wikibugs>	 (03Merged) 10jenkins-bot: listTaskCounts: Do not expect tasks key to be present [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959987 (https://phabricator.wikimedia.org/T347120) (owner: 10Urbanecm)
[13:16:15] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] P:openstack::galera: drop nrpe process check [puppet] - 10https://gerrit.wikimedia.org/r/960612 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah)
[13:16:41] <wikibugs>	 (03PS2) 10Urbanecm: Enable Parsoid support for Kartographer on enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960552 (https://phabricator.wikimedia.org/T342871) (owner: 10Isabelle Hurbain-Palatin)
[13:16:46] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:17:02] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1010.eqiad.wmnet
[13:18:16] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:18:37] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Enable Parsoid support for Kartographer on enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960552 (https://phabricator.wikimedia.org/T342871) (owner: 10Isabelle Hurbain-Palatin)
[13:19:16] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Parsoid support for Kartographer on enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960552 (https://phabricator.wikimedia.org/T342871) (owner: 10Isabelle Hurbain-Palatin)
[13:19:39] <urbanecm>	 ihurbain: i'll be proceeding with your patch soon.
[13:19:44] <ihurbain>	 ok :)
[13:19:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1024', diff saved to https://phabricator.wikimedia.org/P52605 and previous config saved to /var/cache/conftool/dbconfig/20230925-131951-ladsgroup.json
[13:20:34] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50567 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:21:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] cloudlb: remove unused firewall rule [puppet] - 10https://gerrit.wikimedia.org/r/960614 (owner: 10Majavah)
[13:21:24] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] cloudlb: remove unused firewall rule [puppet] - 10https://gerrit.wikimedia.org/r/960614 (owner: 10Majavah)
[13:21:30] <logmsgbot>	 !log jayme@cumin1001 conftool action : set/weight=10; selector: service=kubesvc,cluster=kubernetes,dc=eqiad
[13:21:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:22:01] <logmsgbot>	 !log jayme@cumin1001 conftool action : set/weight=10; selector: service=kubesvc,cluster=kubernetes,dc=codfw
[13:22:11] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb1009.eqiad.wmnet
[13:22:55] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] geo-maps: reorder codfw/eqiad in the default [dns] - 10https://gerrit.wikimedia.org/r/959182 (https://phabricator.wikimedia.org/T346474) (owner: 10Kamila Součková)
[13:23:42] <wikibugs>	 (03CR) 10Hashar: "The image is to build Zuul dependencies (which requires python 2.7) which will allow to migrate the contint* servers from Buster to Bullse" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940161 (https://phabricator.wikimedia.org/T342346) (owner: 10Hashar)
[13:23:45] <jinxer-wm>	 (PuppetConstantChange) firing: (2) Puppet performing a change on every puppet run - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[13:25:15] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:960074|GrowthExperiments: enable AddLink backend 14th round of wikis (T308139)]] (duration: 23m 28s)
[13:25:22] <stashbot>	 T308139: Deploy "add a link" to 14th round of wikis - https://phabricator.wikimedia.org/T308139
[13:25:24] <urbanecm>	 this is...very slow. 20+ minutes per patch.
[13:25:26] <urbanecm>	 sergi0: synced
[13:25:36] <sergi0>	 cool, ty!
[13:26:03] <logmsgbot>	 !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:959987|listTaskCounts: Do not expect tasks key to be present (T347120)]], [[gerrit:959993|AddImageFeedbackHandler: Add missing parameters (T346277)]], [[gerrit:960552|Enable Parsoid support for Kartographer on enwikivoyage (T342871)]]
[13:26:07] <claime>	 urbanecm: Is it the prefetch to k8s nodes taking time?
[13:26:13] <stashbot>	 T342871: Parsoid + Kartographer roll-out plan - https://phabricator.wikimedia.org/T342871
[13:26:13] <stashbot>	 T346277: Addimage feedback API cannot be called successfully - https://phabricator.wikimedia.org/T346277
[13:26:14] <stashbot>	 T347120: PHP Notice: Undefined index: tasks - https://phabricator.wikimedia.org/T347120
[13:26:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:27:12] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T343198)', diff saved to https://phabricator.wikimedia.org/P52606 and previous config saved to /var/cache/conftool/dbconfig/20230925-132711-arnaudb.json
[13:27:18] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[13:28:01] <hashar>	  !log Restarting CI Jenkins
[13:28:03] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1009.eqiad.wmnet
[13:28:04] <urbanecm>	 claime: this is the long steps based on transcript: `build-and-push-container-images (duration: 05m 24s)`, `docker pull on k8s nodes (duration: 02m 10s)`, `sync-check-canaries (duration: 01m 18s)`,  `sync-prod-k8s (duration: 01m 54s)`, `php-fpm-restarts (duration: 02m 39s)`
[13:28:44] <wikibugs>	 (03PS1) 10Peter Fischer: add search update pipeline streams (update + fetch_error) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960616 (https://phabricator.wikimedia.org/T317609)
[13:29:11] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_esams01_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_esams01_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:30:11] <claime>	 That's probably the redis reboot as well
[13:30:16] <claime>	 (the netbox alert)
[13:30:59] <claime>	 restarted the service, looks good
[13:32:01] <urbanecm>	 let's see how second round of scap will look like. but we won't have time for a third one if it takes 20 mins again.
[13:32:31] <icinga-wm>	 PROBLEM - Check systemd state on puppetmaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: session-c107960.scope,session-c107961.scope,session-c107962.scope,session-c107963.scope,session-c107964.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:32:59] <claime>	 The build and push shouldn't be that long if the number of changed layers is low, but it is definitely a pain point we'll have to address
[13:33:35] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[13:34:12] <urbanecm>	 claime: i'm kind of worried what would happen once a patch needs a quick revert because it takes our site down / breaks editing / whatever. 
[13:34:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1024', diff saved to https://phabricator.wikimedia.org/P52607 and previous config saved to /var/cache/conftool/dbconfig/20230925-133457-ladsgroup.json
[13:35:09] <claime>	 urbanecm: We always have the option of using a helm rollback to the former image for k8s I think
[13:35:17] <logmsgbot>	 !log jayme@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=kubernetes,name=kubernetes.*
[13:35:37] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb2010.codfw.wmnet
[13:36:12] <logmsgbot>	 !log jayme@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=kubernetes,name=kubernetes.*
[13:36:13] <logmsgbot>	 !log jayme@cumin1001 conftool action : set/pooled=no; selector: name=kubernetes2010.codfw.wmnet
[13:36:21] <wikibugs>	 (03CR) 10DCausse: add search update pipeline streams (update + fetch_error) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960616 (https://phabricator.wikimedia.org/T317609) (owner: 10Peter Fischer)
[13:38:33] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm and ihurbain: Backport for [[gerrit:959987|listTaskCounts: Do not expect tasks key to be present (T347120)]], [[gerrit:959993|AddImageFeedbackHandler: Add missing parameters (T346277)]], [[gerrit:960552|Enable Parsoid support for Kartographer on enwikivoyage (T342871)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.w
[13:38:33] <logmsgbot>	 mnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[13:38:43] <stashbot>	 T342871: Parsoid + Kartographer roll-out plan - https://phabricator.wikimedia.org/T342871
[13:38:45] <stashbot>	 T346277: Addimage feedback API cannot be called successfully - https://phabricator.wikimedia.org/T346277
[13:38:46] <stashbot>	 T347120: PHP Notice: Undefined index: tasks - https://phabricator.wikimedia.org/T347120
[13:38:47] <urbanecm>	 ihurbain: hi, can you test at mwdebug please?
[13:38:52] <ihurbain>	 yup, doing that
[13:38:54] <urbanecm>	 ty
[13:39:15] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_esams01_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_esams01_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:41:10] <HouseOfM>	 @ubanecm I'll move my patch to a later window
[13:41:27] <urbanecm>	 HouseOfM: yes, that seems like a reasonable decision. thanks. we won't have time for another scap sync, unfortunately.
[13:41:41] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: scap::dsh: temporarily exclude kubernetes2010 [puppet] - 10https://gerrit.wikimedia.org/r/960621 (https://phabricator.wikimedia.org/T347267)
[13:41:59] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:42:15] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2010.codfw.wmnet
[13:42:15] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[13:42:15] <_joe_>	 urbanecm: if you have time for a last patch
[13:42:18] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P52610 and previous config saved to /var/cache/conftool/dbconfig/20230925-134217-arnaudb.json
[13:42:27] <_joe_>	 I would ask you to wait a minute or two
[13:42:51] <urbanecm>	 _joe_: i'm mid-scap run. do you want me to abort it? or just complete as-is?
[13:42:52] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] scap::dsh: temporarily exclude kubernetes2010 [puppet] - 10https://gerrit.wikimedia.org/r/960621 (https://phabricator.wikimedia.org/T347267) (owner: 10Giuseppe Lavagetto)
[13:43:05] <_joe_>	 urbanecm: ccomplete as-is
[13:43:11] <urbanecm>	 ack
[13:43:13] <wikibugs>	 (03PS1) 10David Caro: wmcs: disable pages from nagios/icinga [puppet] - 10https://gerrit.wikimedia.org/r/960622
[13:43:37] <ihurbain>	 urbanecm: we're good on mwdebug
[13:43:41] <urbanecm>	 great, proceeding.
[13:43:42] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm and ihurbain: Continuing with sync
[13:44:21] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:48:48] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] wmcs: disable pages from nagios/icinga (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960622 (owner: 10David Caro)
[13:49:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:49:36] <_joe_>	 urbanecm: ah I got why things were so slow - you're the unlucky one who synced things to the new k8s nodes :)
[13:49:47] <urbanecm>	 but why is it slow twice in a row?
[13:49:53] <_joe_>	 but also now I excluded 2010
[13:49:57] <_joe_>	 yeah that I'm not sure about
[13:50:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1024 (T344589)', diff saved to https://phabricator.wikimedia.org/P52611 and previous config saved to /var/cache/conftool/dbconfig/20230925-135004-ladsgroup.json
[13:50:12] <urbanecm>	 heh :). thanks for excluding 2010 though.
[13:51:18] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb2009.codfw.wmnet
[13:52:32] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover: Sept 2023 Switchover: list new primary DC servers first in debug.json - https://phabricator.wikimedia.org/T346472 (10kamila) 05Open→03Resolved
[13:52:35] <wikibugs>	 10SRE, 10Data-Persistence, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10kamila)
[13:53:07] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover: Sept 2023 Switchover Checklist: MediaWiki - https://phabricator.wikimedia.org/T346474 (10kamila)
[13:53:26] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10Jhancock.wm) @cmooney I haven't received it yet. I checked with the dock to make sure it hasn't arrived and we weren't notified but no luck. Is there a tracking number for the package?
[13:53:47] <wikibugs>	 (03PS3) 10JMeybohm: Update chromium-render to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958473 (https://phabricator.wikimedia.org/T300033)
[13:53:58] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover: Sept 2023 Switchover Checklist: MediaWiki - https://phabricator.wikimedia.org/T346474 (10kamila) 05Open→03Resolved
[13:54:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update chromium-render to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958473 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[13:54:03] <wikibugs>	 10SRE, 10Data-Persistence, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10kamila)
[13:54:34] <wikibugs>	 (03CR) 10FNegri: "I'm not clear if this patch will prevent us from being paged if a physical host goes down. Is alertmanager also sending a page, or should " [puppet] - 10https://gerrit.wikimedia.org/r/960622 (owner: 10David Caro)
[13:54:35] <jinxer-wm>	 (KubernetesAPILatency) resolved: (21) High Kubernetes API latency (LIST blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:55:03] <taavi>	 _joe_: it's been slow every time I've deployed since the switchover, which makes me doubt that theory
[13:55:07] <jinxer-wm>	 (ProbeDown) firing: Service docker-registry:443 has failed probes (http_docker-registry_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:55:16] <wikibugs>	 (03PS2) 10JMeybohm: Update developer-portal to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958479 (https://phabricator.wikimedia.org/T300033)
[13:55:22] <urbanecm>	 like, 20 minutes per patch slow taavi?
[13:55:27] <taavi>	 yes
[13:55:37] <_joe_>	 taavi: the other option is that deploy2002 has a terrible disk
[13:55:49] <urbanecm>	 do we have a task about the post-switchover slowness? if not, i can fill one.
[13:56:17] <wikibugs>	 (03CR) 10Elukey: Avoid pages for ores.discovery.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960569 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[13:56:21] <Lucas_WMDE>	 it was also that slow last time I deployed, see https://sal.toolforge.org/production?q=959304
[13:56:29] <wikibugs>	 (03PS2) 10Elukey: icinga/nagios: remove check_ores* [puppet] - 10https://gerrit.wikimedia.org/r/960567 (https://phabricator.wikimedia.org/T347278)
[13:56:31] <wikibugs>	 (03PS2) 10Elukey: Avoid pages for ores.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/960569 (https://phabricator.wikimedia.org/T347278)
[13:56:36] <_joe_>	 did you report this to release engineering?
[13:56:38] <wikibugs>	 (03CR) 10Elukey: icinga/nagios: remove check_ores* (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960567 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[13:57:00] <_joe_>	 I wouldn't assume the problem is the switchover tbh
[13:57:10] <_joe_>	 but full logs would help us understand what got slower
[13:57:25] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P52612 and previous config saved to /var/cache/conftool/dbconfig/20230925-135724-arnaudb.json
[13:57:28] <taavi>	 https://wikimedia.slack.com/archives/C05H0JYT85V/p1695238823352749 seems like it might be relevant here?
[13:57:48] <urbanecm>	 and now i got `skipping missing values file matching "values-main.yaml"`. 
[13:58:14] <urbanecm>	 and `Error: UPGRADE FAILED: release main failed, and has been rolled back due to atomic being set: timed out waiting for the condition`
[13:58:24] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2009.codfw.wmnet
[13:58:30] <urbanecm>	 and `13:56:07 Rolling back to prior state...` :-/
[13:59:23] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 59278
[13:59:51] <wikibugs>	 (03PS2) 10JMeybohm: Update mathoid to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/953261 (https://phabricator.wikimedia.org/T300033) (owner: 10Effie Mouzeli)
[14:00:04] <icinga-wm>	 PROBLEM - Host ganeti2014 is DOWN: PING CRITICAL - Packet loss = 100%
[14:00:07] <jinxer-wm>	 (ProbeDown) resolved: Service docker-registry:443 has failed probes (http_docker-registry_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:00:32] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb2008.codfw.wmnet
[14:01:08] <wikibugs>	 10SRE, 10Traffic: Implement VTC tests for PURGE requests - https://phabricator.wikimedia.org/T347297 (10Fabfur)
[14:02:03] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] Add SLO definition for the ORES Legacy service (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/955355 (https://phabricator.wikimedia.org/T327620) (owner: 10Elukey)
[14:02:13] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] alertmanager: create ml team alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/958072 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos)
[14:02:23] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] Lift Wing: add latency/availability SLO dashboards (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/952226 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman)
[14:02:45] <wikibugs>	 (03PS6) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385)
[14:02:47] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudservices1006: remove old listen-on address [puppet] - 10https://gerrit.wikimedia.org/r/960624 (https://phabricator.wikimedia.org/T346385)
[14:03:20] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloudservices1006: remove old listen-on address [puppet] - 10https://gerrit.wikimedia.org/r/960624 (https://phabricator.wikimedia.org/T346385) (owner: 10Andrew Bogott)
[14:03:44] <icinga-wm>	 RECOVERY - Host ganeti2014 is UP: PING OK - Packet loss = 0%, RTA = 33.19 ms
[14:04:07] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job ganeti in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:04:19] <_joe_>	 urbanecm: do you have a paste of one of your latest scaps?
[14:04:38] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:959987|listTaskCounts: Do not expect tasks key to be present (T347120)]], [[gerrit:959993|AddImageFeedbackHandler: Add missing parameters (T346277)]], [[gerrit:960552|Enable Parsoid support for Kartographer on enwikivoyage (T342871)]] (duration: 38m 35s)
[14:04:50] <stashbot>	 T342871: Parsoid + Kartographer roll-out plan - https://phabricator.wikimedia.org/T342871
[14:04:50] <stashbot>	 T346277: Addimage feedback API cannot be called successfully - https://phabricator.wikimedia.org/T346277
[14:04:51] <stashbot>	 T347120: PHP Notice: Undefined index: tasks - https://phabricator.wikimedia.org/T347120
[14:07:00] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2008.codfw.wmnet
[14:07:05] <jayme>	 urbanecm: does it say which environment?
[14:07:37] <wikibugs>	 (03PS1) 10JMeybohm: Update machinetranslation to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/960625 (https://phabricator.wikimedia.org/T300033)
[14:08:44] <jinxer-wm>	 (JobUnavailable) firing: (7) Reduced availability for job ganeti in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:10:02] <icinga-wm>	 PROBLEM - Host ganeti2014 is DOWN: PING CRITICAL - Packet loss = 100%
[14:10:13] <_joe_>	 we're still running the old version of the code in k8s btw
[14:10:25] <_joe_>	 jayme: you should run a scap --k8s-only tbh
[14:12:31] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T343198)', diff saved to https://phabricator.wikimedia.org/P52613 and previous config saved to /var/cache/conftool/dbconfig/20230925-141230-arnaudb.json
[14:12:33] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance
[14:12:38] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[14:12:42] <wikibugs>	 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1007: move to new network setup - https://phabricator.wikimedia.org/T346892 (10Jclark-ctr) @taavi  Relovated to rack E 4.  updated netbox with location.   switch port is #9
[14:12:46] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance
[14:12:49] <wikibugs>	 (03CR) 10Muehlenhoff: admin: Create analytics-wmde system user and airflow admin group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene)
[14:12:53] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2150 (T343198)', diff saved to https://phabricator.wikimedia.org/P52614 and previous config saved to /var/cache/conftool/dbconfig/20230925-141252-arnaudb.json
[14:13:05] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1025.eqiad.wmnet with reason: Maintenance
[14:13:07] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1025.eqiad.wmnet with reason: Maintenance
[14:13:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1025 (T344589)', diff saved to https://phabricator.wikimedia.org/P52615 and previous config saved to /var/cache/conftool/dbconfig/20230925-141313-ladsgroup.json
[14:13:19] <wikibugs>	 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1007: move to new network setup - https://phabricator.wikimedia.org/T346892 (10Jclark-ctr)
[14:13:30] <icinga-wm>	 RECOVERY - Host ganeti2014 is UP: PING OK - Packet loss = 0%, RTA = 33.73 ms
[14:14:08] <jinxer-wm>	 (JobUnavailable) firing: (7) Reduced availability for job ganeti in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:16:07] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] Wikifunctions: Update evaluator image to 2023-09-19-183305 [deployment-charts] - 10https://gerrit.wikimedia.org/r/959037 (owner: 10Jforrester)
[14:17:12] <wikibugs>	 (03Merged) 10jenkins-bot: Wikifunctions: Update evaluator image to 2023-09-19-183305 [deployment-charts] - 10https://gerrit.wikimedia.org/r/959037 (owner: 10Jforrester)
[14:17:16] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] mobileapps: increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/960596 (owner: 10Hnowlan)
[14:18:02] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/960596 (owner: 10Hnowlan)
[14:18:38] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:18:41] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:18:44] <jinxer-wm>	 (JobUnavailable) firing: (7) Reduced availability for job ganeti in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:19:07] <jinxer-wm>	 (JobUnavailable) firing: (7) Reduced availability for job ganeti in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:19:12] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:19:14] <wikibugs>	 (03PS1) 10JMeybohm: Remove all quota from mw namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/960626
[14:19:15] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:20:02] <_joe_>	 taavi: so it seems we're indeed rebuilding from scratch for every release, not sure why.
[14:20:19] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:20:33] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Replace RAID controller battery on an-worker1086 - https://phabricator.wikimedia.org/T347287 (10Jclark-ctr) @BTullis  i am on site today.  otherwise we can do it tomorrow
[14:20:50] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:21:11] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] Remove all quota from mw namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/960626 (owner: 10JMeybohm)
[14:21:12] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[14:21:17] <wikibugs>	 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1007: move to new network setup - https://phabricator.wikimedia.org/T346892 (10Jclark-ctr) a:05Jclark-ctr→03taavi
[14:21:59] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Remove all quota from mw namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/960626 (owner: 10JMeybohm)
[14:22:07] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[14:22:11] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[14:22:53] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Remove all quota from mw namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/960626 (owner: 10JMeybohm)
[14:22:59] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[14:23:48] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Remove all quota from mw namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/960626 (owner: 10JMeybohm)
[14:23:52] <icinga-wm>	 PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_ssh-gitlab.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:24:24] <logmsgbot>	 !log herron@cumin1001 START - Cookbook sre.hosts.decommission for hosts dispatch-be2001.codfw.wmnet,dispatch-be1001.eqiad.wmnet
[14:24:51] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[14:24:53] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[14:24:59] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[14:25:06] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[14:25:11] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[14:25:21] <wikibugs>	 (03PS2) 10Jforrester: Re-apply "Fix wikifunctions orchestrator not using the service mesh" [deployment-charts] - 10https://gerrit.wikimedia.org/r/953212 (https://phabricator.wikimedia.org/T344998)
[14:25:54] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43532/console" [puppet] - 10https://gerrit.wikimedia.org/r/960567 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[14:26:10] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[14:27:24] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] Re-apply "Fix wikifunctions orchestrator not using the service mesh" [deployment-charts] - 10https://gerrit.wikimedia.org/r/953212 (https://phabricator.wikimedia.org/T344998) (owner: 10Jforrester)
[14:28:01] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] Avoid pages for ores.discovery.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960569 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[14:28:38] <wikibugs>	 (03Merged) 10jenkins-bot: Re-apply "Fix wikifunctions orchestrator not using the service mesh" [deployment-charts] - 10https://gerrit.wikimedia.org/r/953212 (https://phabricator.wikimedia.org/T344998) (owner: 10Jforrester)
[14:28:41] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[14:29:08] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[14:29:45] <logmsgbot>	 !log herron@cumin1001 START - Cookbook sre.dns.netbox
[14:30:36] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[14:31:03] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[14:31:10] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[14:31:28] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:31:50] <logmsgbot>	 !log herron@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dispatch-be2001.codfw.wmnet,dispatch-be1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - herron@cumin1001"
[14:31:50] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] icinga/nagios: remove check_ores* [puppet] - 10https://gerrit.wikimedia.org/r/960567 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[14:32:07] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:32:23] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[14:33:08] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[14:33:34] <logmsgbot>	 !log jayme@deploy2002 Started scap: (no justification provided)
[14:34:10] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[14:34:17] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[14:34:27] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/958807 (https://phabricator.wikimedia.org/T346656) (owner: 10Andrea Denisse)
[14:35:09] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[14:35:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1025 (T344589)', diff saved to https://phabricator.wikimedia.org/P52618 and previous config saved to /var/cache/conftool/dbconfig/20230925-143523-ladsgroup.json
[14:36:12] <wikibugs>	 (03PS1) 10Jforrester: Revert "Re-apply "Fix wikifunctions orchestrator not using the service mesh"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/959997
[14:36:16] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] Revert "Re-apply "Fix wikifunctions orchestrator not using the service mesh"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/959997 (owner: 10Jforrester)
[14:36:44] <logmsgbot>	 !log jayme@deploy2002 Finished scap: (no justification provided) (duration: 03m 09s)
[14:37:00] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Re-apply "Fix wikifunctions orchestrator not using the service mesh"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/959997 (owner: 10Jforrester)
[14:37:06] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nat Hillard - https://phabricator.wikimedia.org/T342588 (10CDanis) 05Stalled→03In progress a:05NHillard-WMF→03CDanis
[14:37:26] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! very nice" [puppet] - 10https://gerrit.wikimedia.org/r/960567 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[14:37:57] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[14:38:26] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] icinga/nagios: remove check_ores* [puppet] - 10https://gerrit.wikimedia.org/r/960567 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[14:38:47] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[14:38:55] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[14:39:38] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] Avoid pages for ores.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/960569 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[14:39:43] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[14:39:57] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:40:08] <wikibugs>	 (03PS3) 10Elukey: Avoid pages for ores.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/960569 (https://phabricator.wikimedia.org/T347278)
[14:40:16] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] ml-services: remove old eswikiquote and eswikibooks models [deployment-charts] - 10https://gerrit.wikimedia.org/r/960234 (https://phabricator.wikimedia.org/T342266) (owner: 10Ilias Sarantopoulos)
[14:40:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:40:48] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Avoid pages for ores.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/960569 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[14:41:18] <wikibugs>	 (03PS1) 10Jclark-ctr: new  node T342660 [puppet] - 10https://gerrit.wikimedia.org/r/960631 (https://phabricator.wikimedia.org/T342660)
[14:41:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] new  node T342660 [puppet] - 10https://gerrit.wikimedia.org/r/960631 (https://phabricator.wikimedia.org/T342660) (owner: 10Jclark-ctr)
[14:42:01] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] sre: add jaeger query/collector alerts [alerts] - 10https://gerrit.wikimedia.org/r/959950 (https://phabricator.wikimedia.org/T345712) (owner: 10Filippo Giunchedi)
[14:42:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] otel-coll: enable prometheus scraping [deployment-charts] - 10https://gerrit.wikimedia.org/r/960056 (https://phabricator.wikimedia.org/T345712) (owner: 10Filippo Giunchedi)
[14:42:48] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] APIGW: add entry for multilingual readability LW isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/959684 (https://phabricator.wikimedia.org/T334182) (owner: 10Klausman)
[14:43:36] <logmsgbot>	 !log herron@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dispatch-be2001.codfw.wmnet,dispatch-be1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - herron@cumin1001"
[14:43:36] <logmsgbot>	 !log herron@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:43:37] <logmsgbot>	 !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dispatch-be2001.codfw.wmnet,dispatch-be1001.eqiad.wmnet
[14:43:40] <wikibugs>	 (03PS38) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822)
[14:43:42] <wikibugs>	 (03PS1) 10AOkoth: gitlab: change service_name on replica hosts [puppet] - 10https://gerrit.wikimedia.org/r/960632 (https://phabricator.wikimedia.org/T345590)
[14:44:35] <wikibugs>	 (03PS2) 10Herron: remove dispatch dns record [dns] - 10https://gerrit.wikimedia.org/r/957799 (https://phabricator.wikimedia.org/T344937)
[14:44:37] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb2007.codfw.wmnet
[14:45:14] <wikibugs>	 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T347257 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm caused by issue in T347267. will resolve there.
[14:45:33] <wikibugs>	 (03PS2) 10Jclark-ctr: new  node T342660 [puppet] - 10https://gerrit.wikimedia.org/r/960631 (https://phabricator.wikimedia.org/T342660)
[14:45:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (8) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:45:49] <logmsgbot>	 !log filippo@deploy2002 helmfile [eqiad] START helmfile.d/services/opentelemetry-collector: apply
[14:45:56] <logmsgbot>	 !log filippo@deploy2002 helmfile [eqiad] DONE helmfile.d/services/opentelemetry-collector: apply
[14:45:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] new  node T342660 [puppet] - 10https://gerrit.wikimedia.org/r/960631 (https://phabricator.wikimedia.org/T342660) (owner: 10Jclark-ctr)
[14:46:07] <logmsgbot>	 !log filippo@deploy2002 helmfile [codfw] START helmfile.d/services/opentelemetry-collector: apply
[14:46:12] <logmsgbot>	 !log filippo@deploy2002 helmfile [codfw] DONE helmfile.d/services/opentelemetry-collector: apply
[14:47:44] <wikibugs>	 (03PS3) 10Jclark-ctr: new  node T342660 [puppet] - 10https://gerrit.wikimedia.org/r/960631 (https://phabricator.wikimedia.org/T342660)
[14:47:57] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] "One nit, otherwise lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/959684 (https://phabricator.wikimedia.org/T334182) (owner: 10Klausman)
[14:48:03] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: kubernetes2010 down - https://phabricator.wikimedia.org/T347267 (10Jhancock.wm) server is not getting to POST. starting troubleshooting.
[14:48:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] new  node T342660 [puppet] - 10https://gerrit.wikimedia.org/r/960631 (https://phabricator.wikimedia.org/T342660) (owner: 10Jclark-ctr)
[14:48:11] <wikibugs>	 (03CR) 10Herron: [C: 03+2] remove dispatch dns record [dns] - 10https://gerrit.wikimedia.org/r/957799 (https://phabricator.wikimedia.org/T344937) (owner: 10Herron)
[14:48:30] <wikibugs>	 (03PS1) 10AOkoth: gitlab: swap replica records [dns] - 10https://gerrit.wikimedia.org/r/960633 (https://phabricator.wikimedia.org/T345590)
[14:48:33] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Aisha Khatun - https://phabricator.wikimedia.org/T346796 (10colewhite) >>! In T346796#9192790, @AKhatun_WMF wrote: > I am getting this error when I kinit > `kinit: Client 'akhatun@WIKIMEDIA' not found in Kerberos database while...
[14:48:53] <wikibugs>	 (03PS2) 10AOkoth: gitlab: swap replica records [dns] - 10https://gerrit.wikimedia.org/r/960633 (https://phabricator.wikimedia.org/T345590)
[14:49:04] <wikibugs>	 (03PS2) 10AOkoth: gitlab: change service_name on replica hosts [puppet] - 10https://gerrit.wikimedia.org/r/960632 (https://phabricator.wikimedia.org/T345590)
[14:49:09] <wikibugs>	 (03PS4) 10Jclark-ctr: new  node T342660 [puppet] - 10https://gerrit.wikimedia.org/r/960631 (https://phabricator.wikimedia.org/T342660)
[14:49:16] <wikibugs>	 (03PS2) 10Klausman: APIGW: add entry for multilingual readability LW isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/959684 (https://phabricator.wikimedia.org/T334182)
[14:49:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] new  node T342660 [puppet] - 10https://gerrit.wikimedia.org/r/960631 (https://phabricator.wikimedia.org/T342660) (owner: 10Jclark-ctr)
[14:49:56] <wikibugs>	 (03CR) 10Klausman: APIGW: add entry for multilingual readability LW isvc (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/959684 (https://phabricator.wikimedia.org/T334182) (owner: 10Klausman)
[14:50:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1025', diff saved to https://phabricator.wikimedia.org/P52619 and previous config saved to /var/cache/conftool/dbconfig/20230925-145029-ladsgroup.json
[14:52:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] gitlab: swap replica records [dns] - 10https://gerrit.wikimedia.org/r/960633 (https://phabricator.wikimedia.org/T345590) (owner: 10AOkoth)
[14:53:25] <wikibugs>	 (03PS5) 10Jclark-ctr: new  node T342660 [puppet] - 10https://gerrit.wikimedia.org/r/960631 (https://phabricator.wikimedia.org/T342660)
[14:53:26] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:53:28] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2007.codfw.wmnet
[14:54:06] <wikibugs>	 (03CR) 10Jclark-ctr: [C: 03+2] new  node T342660 [puppet] - 10https://gerrit.wikimedia.org/r/960631 (https://phabricator.wikimedia.org/T342660) (owner: 10Jclark-ctr)
[14:54:32] <wikibugs>	 (03CR) 10Herron: [C: 03+2] dispatch: remove puppetization [puppet] - 10https://gerrit.wikimedia.org/r/957756 (https://phabricator.wikimedia.org/T344937) (owner: 10Herron)
[14:54:48] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:55:41] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] gitlab: delay restore timer 30 minutes [puppet] - 10https://gerrit.wikimedia.org/r/959683 (owner: 10Jelto)
[14:56:44] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:57:06] <moritzm>	 !log installing python3.7 security updates
[14:57:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:57:30] <wikibugs>	 (03PS6) 10Andrea Denisse: prometheus: Prevent Prometheus from scraping certain statsd-exporters [puppet] - 10https://gerrit.wikimedia.org/r/958807 (https://phabricator.wikimedia.org/T346656)
[14:58:08] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:59:37] <wikibugs>	 (03CR) 10Elukey: new  node T342660 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960631 (https://phabricator.wikimedia.org/T342660) (owner: 10Jclark-ctr)
[15:00:15] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[15:02:28] <wikibugs>	 (03CR) 10David Caro: wmcs: disable pages from nagios/icinga (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960622 (owner: 10David Caro)
[15:04:02] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] puppetserver: Serve the full cert chain via jetty [puppet] - 10https://gerrit.wikimedia.org/r/959238 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway)
[15:04:14] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] "thanks for the review" [puppet] - 10https://gerrit.wikimedia.org/r/959238 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway)
[15:04:39] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] "thanks for the review" [puppet] - 10https://gerrit.wikimedia.org/r/959241 (https://phabricator.wikimedia.org/T344868) (owner: 10JHathaway)
[15:04:55] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] "thanks for the review" [puppet] - 10https://gerrit.wikimedia.org/r/959234 (https://phabricator.wikimedia.org/T344868) (owner: 10JHathaway)
[15:05:12] <wikibugs>	 (03CR) 10JHathaway: "thanks for the review" [puppet] - 10https://gerrit.wikimedia.org/r/959232 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway)
[15:05:14] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] puppetdb prometheus exporter: in a container listen on all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/959232 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway)
[15:05:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1025', diff saved to https://phabricator.wikimedia.org/P52620 and previous config saved to /var/cache/conftool/dbconfig/20230925-150536-ladsgroup.json
[15:06:04] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:06:08] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] "thanks for the review" [puppet] - 10https://gerrit.wikimedia.org/r/959226 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway)
[15:06:16] <wikibugs>	 (03CR) 10AikoChou: [C: 03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/959684 (https://phabricator.wikimedia.org/T334182) (owner: 10Klausman)
[15:06:32] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] "thanks for the review" [puppet] - 10https://gerrit.wikimedia.org/r/959224 (https://phabricator.wikimedia.org/T344868) (owner: 10JHathaway)
[15:06:49] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] "thanks for the reviews" [puppet] - 10https://gerrit.wikimedia.org/r/959227 (owner: 10JHathaway)
[15:07:00] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50569 bytes in 2.464 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:08:10] <wikibugs>	 (03PS2) 10Peter Fischer: add search update pipeline streams (update + fetch_error) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960616 (https://phabricator.wikimedia.org/T317609)
[15:08:57] <wikibugs>	 (03CR) 10Peter Fischer: add search update pipeline streams (update + fetch_error) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960616 (https://phabricator.wikimedia.org/T317609) (owner: 10Peter Fischer)
[15:10:16] <wikibugs>	 (03PS1) 10Muehlenhoff: standard_packages: Remove Python 3.7 packages after buster->bullseye update [puppet] - 10https://gerrit.wikimedia.org/r/960634
[15:10:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:11:58] <icinga-wm>	 RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:12:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10Jclark-ctr)
[15:12:34] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10Jclark-ctr)
[15:13:42] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: kubernetes2010 down - https://phabricator.wikimedia.org/T347267 (10Jhancock.wm) @JMeybohm looks like the system board has died. Server powers on, but even with minimum hardware configuration the server will not actually boot up. Idrac is also inaccessible.   This...
[15:14:09] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: kubernetes2010 down - https://phabricator.wikimedia.org/T347267 (10Jhancock.wm) a:03Jhancock.wm
[15:14:32] <logmsgbot>	 !log taavi@cumin1001 START - Cookbook sre.dns.netbox
[15:15:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: "The patch LGTM, I thought about it a little bit and it'll work as-is, however:" [puppet] - 10https://gerrit.wikimedia.org/r/958807 (https://phabricator.wikimedia.org/T346656) (owner: 10Andrea Denisse)
[15:15:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:16:12] <wikibugs>	 (03PS1) 10FNegri: Add more details to Readme [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/960637
[15:16:54] <logmsgbot>	 !log taavi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign new IPs to cloudcontrol1007 - taavi@cumin1001"
[15:17:07] <wikibugs>	 (03CR) 10JHathaway: puppetdb: preseed to avoid creating database users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959231 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway)
[15:17:43] <logmsgbot>	 !log taavi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign new IPs to cloudcontrol1007 - taavi@cumin1001"
[15:17:43] <logmsgbot>	 !log taavi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:19:00] <herron>	 !log alert[12]001 -- apt remove docker.io T344937
[15:19:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:19:07] <stashbot>	 T344937: Decom dispatch infrastructure - https://phabricator.wikimedia.org/T344937
[15:20:22] <wikibugs>	 (03PS1) 10Andrea Denisse: superset: Disable Prometheus scraping for superset metrics [puppet] - 10https://gerrit.wikimedia.org/r/960638 (https://phabricator.wikimedia.org/T346656)
[15:20:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1025 (T344589)', diff saved to https://phabricator.wikimedia.org/P52621 and previous config saved to /var/cache/conftool/dbconfig/20230925-152043-ladsgroup.json
[15:20:44] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[15:21:28] <herron>	 !log alert[12]001 -- rm /etc/apache2/sites-available/50-dispatch-wikimedia-org.conf && apachectl graceful  T344937
[15:21:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:21:36] <wikibugs>	 (03CR) 10FNegri: "For some reason I don't have +2 rights on this repo. I have built the package on mcrouter.packaging.eqiad1.wikimedia.cloud, can you please" [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/959212 (https://phabricator.wikimedia.org/T346762) (owner: 10FNegri)
[15:21:54] <logmsgbot>	 !log taavi@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudcontrol1007
[15:22:34] <logmsgbot>	 !log taavi@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcontrol1007
[15:23:05] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new records for cloudcontrol1007 - cmooney@cumin1001"
[15:23:54] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new records for cloudcontrol1007 - cmooney@cumin1001"
[15:23:54] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:23:54] <wikibugs>	 (03PS1) 10Muehlenhoff: puppetdb: Remove obsolete Hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/960641
[15:24:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove traceback-roots [puppet] - 10https://gerrit.wikimedia.org/r/960548 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff)
[15:24:30] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove traceback-roots [puppet] - 10https://gerrit.wikimedia.org/r/960548 (https://phabricator.wikimedia.org/T276465)
[15:26:14] <wikibugs>	 (03PS3) 10Cwhite: prometheus: add option to configure probe-specific params [puppet] - 10https://gerrit.wikimedia.org/r/958981 (https://phabricator.wikimedia.org/T346893)
[15:26:16] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[15:26:42] <wikibugs>	 (03PS1) 10Majavah: site: re-assign role for cloudcontrol1007 [puppet] - 10https://gerrit.wikimedia.org/r/960642 (https://phabricator.wikimedia.org/T346892)
[15:27:23] <wikibugs>	 (03PS4) 10Cwhite: prometheus: add option to configure probe-specific params [puppet] - 10https://gerrit.wikimedia.org/r/958981 (https://phabricator.wikimedia.org/T346893)
[15:27:36] <wikibugs>	 (03PS2) 10Majavah: site: re-assign role for cloudcontrol1007 [puppet] - 10https://gerrit.wikimedia.org/r/960642 (https://phabricator.wikimedia.org/T346892)
[15:28:39] <wikibugs>	 (03PS3) 10Majavah: site: re-assign role for cloudcontrol1007 [puppet] - 10https://gerrit.wikimedia.org/r/960642 (https://phabricator.wikimedia.org/T346892)
[15:28:45] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: site: re-assign role for cloudcontrol1007 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960642 (https://phabricator.wikimedia.org/T346892) (owner: 10Majavah)
[15:28:57] <wikibugs>	 (03CR) 10Majavah: site: re-assign role for cloudcontrol1007 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960642 (https://phabricator.wikimedia.org/T346892) (owner: 10Majavah)
[15:29:22] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] site: re-assign role for cloudcontrol1007 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960642 (https://phabricator.wikimedia.org/T346892) (owner: 10Majavah)
[15:29:33] <wikibugs>	 10SRE, 10ops-eqiad, 10Patch-For-Review, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1007: move to new network setup - https://phabricator.wikimedia.org/T346892 (10taavi)
[15:29:56] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] standard_packages: Remove Python 3.7 packages after buster->bullseye update [puppet] - 10https://gerrit.wikimedia.org/r/960634 (owner: 10Muehlenhoff)
[15:30:05] <jouncebot>	 jan_drewniak: Dear deployers, time to do the Wikimedia Portals Update deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230925T1530).
[15:30:07] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] standard_packages: Remove Python 3.7 packages after buster->bullseye update [puppet] - 10https://gerrit.wikimedia.org/r/960634 (owner: 10Muehlenhoff)
[15:30:36] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: add option to configure probe-specific params [puppet] - 10https://gerrit.wikimedia.org/r/958981 (https://phabricator.wikimedia.org/T346893) (owner: 10Cwhite)
[15:30:51] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] prometheus: add option to configure probe-specific params (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/958981 (https://phabricator.wikimedia.org/T346893) (owner: 10Cwhite)
[15:31:30] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: kubernetes2010 down - https://phabricator.wikimedia.org/T347267 (10JMeybohm) Thanks! We did not plan to decom immediately, so it would really help us if you could replace the board and we could run the server for a bit longer.
[15:33:12] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Alert on Varnish high thread count - https://phabricator.wikimedia.org/T323723 (10BCornwall) 05Resolved→03In progress
[15:33:18] <wikibugs>	 (03PS1) 10Cwhite: wmflib: fix typo in probe type [puppet] - 10https://gerrit.wikimedia.org/r/959985 (https://phabricator.wikimedia.org/T346893)
[15:33:24] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Alert on Varnish high thread count - https://phabricator.wikimedia.org/T323723 (10BCornwall) @Vgutierrez Thanks for your patch fixing thread_pool_max; IIRC @bblack had advised the flat 12000 max threads due to the arbitrary nature of the processorcount. Is this patch to...
[15:33:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10VRiley-WMF) wdqs1020 - E 2. U 18 CableID 2303045000257 Port 38 wdqs1021 - F 2. U 42. CableID 2303045000256 Port 20 wdqs1022 - D 2. U 13. CableID 230304500202 Port 25 wdqs1023...
[15:33:53] <wikibugs>	 (03Abandoned) 10Cwhite: wmflib: fix typo in probe type [puppet] - 10https://gerrit.wikimedia.org/r/959985 (https://phabricator.wikimedia.org/T346893) (owner: 10Cwhite)
[15:34:21] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM, see my comment on https://gerrit.wikimedia.org/r/c/operations/puppet/+/958807 about deploying that patch, once that's done we can me" [puppet] - 10https://gerrit.wikimedia.org/r/960638 (https://phabricator.wikimedia.org/T346656) (owner: 10Andrea Denisse)
[15:37:22] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1032.eqiad.wmnet with reason: Maintenance
[15:37:33] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] ml-services: remove old eswikiquote and eswikibooks models [deployment-charts] - 10https://gerrit.wikimedia.org/r/960234 (https://phabricator.wikimedia.org/T342266) (owner: 10Ilias Sarantopoulos)
[15:37:35] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1032.eqiad.wmnet with reason: Maintenance
[15:39:02] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:39:54] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:40:00] <icinga-wm>	 PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: upload_puppet_facts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:41:21] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: remove old eswikiquote and eswikibooks models [deployment-charts] - 10https://gerrit.wikimedia.org/r/960234 (https://phabricator.wikimedia.org/T342266) (owner: 10Ilias Sarantopoulos)
[15:42:18] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: remove old eswikiquote and eswikibooks models [deployment-charts] - 10https://gerrit.wikimedia.org/r/960234 (https://phabricator.wikimedia.org/T342266) (owner: 10Ilias Sarantopoulos)
[15:43:19] <taavi>	 Amir1: elukey: hmm, https://petscan.wmflabs.org/ seems to expect that ores can return something in a javascript callback format instead of being JSON. is that supposed to be supported?
[15:43:55] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1033.eqiad.wmnet with reason: Maintenance
[15:44:14] <wikibugs>	 (03PS5) 10C. Scott Ananian: Re-enable Extension:ParserMigration on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944978 (https://phabricator.wikimedia.org/T333179) (owner: 10Sbailey)
[15:44:19] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1033.eqiad.wmnet with reason: Maintenance
[15:45:08] <elukey>	 taavi: never heard about it, not even from logs.. ores legacy definitely doesn't support a js callback, didn't even know that ores supported that. Do you have a moment to open a task with the query that it is made?
[15:47:56] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Package for Debian Bookworm [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/959212 (https://phabricator.wikimedia.org/T346762) (owner: 10FNegri)
[15:48:44] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 5.284 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:49:30] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50567 bytes in 0.060 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:49:51] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1034.eqiad.wmnet with reason: Maintenance
[15:50:04] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1034.eqiad.wmnet with reason: Maintenance
[15:50:54] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:52:18] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:54:05] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[15:55:04] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[15:55:06] <wikibugs>	 (03PS1) 10CDanis: nathillard analytics-privatedata-users access [puppet] - 10https://gerrit.wikimedia.org/r/960647 (https://phabricator.wikimedia.org/T342588)
[15:55:56] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:55:59] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] nathillard analytics-privatedata-users access [puppet] - 10https://gerrit.wikimedia.org/r/960647 (https://phabricator.wikimedia.org/T342588) (owner: 10CDanis)
[15:56:40] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Alert on Varnish high thread count - https://phabricator.wikimedia.org/T323723 (10BBlack) To clarify and expand on my position about this thread count parameter (which is really just a side-issue related to this ticket, which is fundamentally complete):  1. Varnish's thr...
[15:56:46] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:57:01] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[15:57:32] <wikibugs>	 (03PS3) 10AOkoth: gitlab: swap replica records [dns] - 10https://gerrit.wikimedia.org/r/960633 (https://phabricator.wikimedia.org/T345590)
[15:57:46] <wikibugs>	 (03PS4) 10AOkoth: gitlab: swap replica records [dns] - 10https://gerrit.wikimedia.org/r/960633 (https://phabricator.wikimedia.org/T345590)
[15:58:32] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Nat Hillard - https://phabricator.wikimedia.org/T342588 (10CDanis) 05In progress→03Resolved Hi Issac, sorry this slipped through SRE's process as well -- this should have been taken care of last week....
[15:58:57] <claime>	  /26
[16:00:32] <jinxer-wm>	 (DatasourceError) firing: Nonwrite HTTP requests with primary DB connections alert - https://grafana.wikimedia.org/alerting/grafana/4tAKSjJVz/view - https://wikitech.wikimedia.org/wiki/Monitoring/DatasourceError   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError
[16:01:03] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:01:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] gitlab: swap replica records [dns] - 10https://gerrit.wikimedia.org/r/960633 (https://phabricator.wikimedia.org/T345590) (owner: 10AOkoth)
[16:01:28] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[16:03:36] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50568 bytes in 0.169 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:04:16] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.286 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:04:57] <wikibugs>	 (03CR) 10Peter Fischer: "Thank you for adapting the chart! Just noticed a config-naming-issue in the consumer (fetch failure -> fetch error)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (owner: 10Ebernhardson)
[16:06:03] <jinxer-wm>	 (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:06:55] <icinga-wm>	 PROBLEM - Host db2109 #page is DOWN: PING CRITICAL - Packet loss = 100%
[16:07:45] <sukhe>	 hmm, I will depool
[16:07:51] <bblack>	 that's non-paging I guess?
[16:07:55] <sukhe>	 paging
[16:08:26] <bblack>	 well, I'm basing that on the lack of "# page" that's usually present (spacing that out because I think it triggers)
[16:08:47] <bblack>	 and that I got no page, even though I'm in business hours.  Even when I open splunk, nothing.
[16:08:47] <claime>	 It's between the hostname and "is DOWN"
[16:08:49] <sobanski>	 It's there
[16:08:57] <bblack>	 oh it is there, so the rest of my questions remain
[16:09:05] <logmsgbot>	 !log sukhe@cumin2002 dbctl commit (dc=all): 'Depool db2109', diff saved to https://phabricator.wikimedia.org/P52622 and previous config saved to /var/cache/conftool/dbconfig/20230925-160904-sukhe.json
[16:09:17] <claime>	 It is not in sirenbot's incidents though
[16:09:24] <sukhe>	 depooled
[16:09:33] <bblack>	 on active, no triggered, no acked, in the splunk UI on my phone
[16:09:36] <bblack>	 s/on/no/
[16:09:55] <rzl>	 you wouldn't have gotten paged because you aren't on call, but it does also seem like the page never got to VO in the first place
[16:10:18] <rzl>	 oh, no I do see it under triggered
[16:10:33] <bblack>	 I have one showing there now, too
[16:10:33] <rzl>	 but a few minutes delayed getting there, which isn't good
[16:10:37] <bblack>	 but I didn't until just now :)
[16:10:43] <cdanis>	 I just got the push notification
[16:10:43] <sukhe>	 weird!
[16:10:45] <cdanis>	 and I am oncall
[16:10:48] <cdanis>	 thanks sukhe 
[16:11:14] <arnoldokoth>	 Just got paged too.
[16:11:22] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[16:11:26] <bblack>	 so it's just some delay issue
[16:11:38] <bblack>	 at least a few minutes
[16:11:41] <cdanis>	 icinga --> victorops uses SMTP, iirc
[16:11:57] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[16:12:00] <jynus>	 there are still mw errors but it is the dumping process
[16:12:06] <jynus>	 not end-user errors
[16:12:31] <jynus>	 or some other process in mwmaint2002
[16:14:46] <wikibugs>	 (03Abandoned) 10Ilias Sarantopoulos: api-gateway: change liftwing hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/940945 (https://phabricator.wikimedia.org/T342266) (owner: 10Ilias Sarantopoulos)
[16:15:32] <jinxer-wm>	 (DatasourceError) resolved: Nonwrite HTTP requests with primary DB connections alert - https://grafana.wikimedia.org/alerting/grafana/4tAKSjJVz/view - https://wikitech.wikimedia.org/wiki/Monitoring/DatasourceError   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError
[16:16:29] <volans>	 this is a performance alert AFAICT, why is it alerting in here?
[16:17:25] <jynus>	 It is "MWScript.php migrateLinksTable --wiki=ruwikinews --table pagelinks --batch-size 10000 --sleep 0.1"
[16:17:56] <jynus>	 hopefully someone can search the process on mwmaint2002 on kill it
[16:18:02] <jynus>	 *and kill it
[16:18:15] <jynus>	 so it doesn't send more errors to the logs
[16:18:38] <taavi>	 elukey: https://phabricator.wikimedia.org/T347317
[16:18:55] <marostegui>	 am I needed?
[16:19:02] <jynus>	 no, it is a mw job
[16:19:17] <jynus>	 that hasn't updated after the depool
[16:19:38] <cdanis>	 db2109 looks to have been just ... powered off??
[16:19:46] <cdanis>	 there's nothing in the SEL
[16:19:49] <cdanis>	 and powerstatus is OFF
[16:19:52] <marostegui>	 I'll create a task for it
[16:19:59] <marostegui>	 cdanis: maybe a loose cable (
[16:20:01] <marostegui>	 ?
[16:20:07] <cdanis>	 marostegui: two loose cables?
[16:20:14] <cdanis>	 I thought we had redundant PSUs
[16:20:14] <jynus>	 marostegui: I mean that it could be handled tomorrow, it was not an emergency
[16:20:25] <jynus>	 after the depool
[16:20:59] <marostegui>	 thanks jynus 
[16:21:03] <marostegui>	 I just created the task 
[16:21:20] <marostegui>	 https://phabricator.wikimedia.org/T347318
[16:22:59] <wikibugs>	 10ops-codfw, 10DBA: db2109 crashed - https://phabricator.wikimedia.org/T347318 (10CDanis) Nothing recent in the SAL.  racadm serveraction powerstatus reports OFF.  I guess someone asked the host to shut down via the management interface?
[16:23:02] <cdanis>	 thanks marostegui 
[16:27:34] <icinga-wm>	 RECOVERY - Host kubernetes2010 is UP: PING OK - Packet loss = 0%, RTA = 31.66 ms
[16:28:46] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 175, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:32:30] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:34:26] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:34:30] <_joe_>	 uh
[16:34:36] <_joe_>	 is that k8s2010?
[16:35:08] <jayme>	 seems like it, there is a recovery at lesst
[16:41:34] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 175, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:42:45] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: kubernetes2010 down - https://phabricator.wikimedia.org/T347267 (10Jhancock.wm) got it replaced. updated the asset tag, idrac IP, bios/idrac firmware, and adjusted some bios settings. the idrac and network addresses are pinging, and there are no alerts that I can...
[16:47:20] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:48:46] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:51:53] <jayme>	 !log uncordon kubernetes2010.codfw.wmnet - T347267
[16:51:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:52:00] <stashbot>	 T347267: kubernetes2010 down - https://phabricator.wikimedia.org/T347267
[16:53:03] <wikibugs>	 (03PS1) 10JMeybohm: Revert "scap::dsh: temporarily exclude kubernetes2010" [puppet] - 10https://gerrit.wikimedia.org/r/960003
[16:53:18] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:53:18] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: kubernetes2010 down - https://phabricator.wikimedia.org/T347267 (10JMeybohm) Nice, thanks for handling this so quickly! Nothing more to do from your end
[16:54:06] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: kubernetes2010 down - https://phabricator.wikimedia.org/T347267 (10Jhancock.wm) 05Open→03Resolved
[16:54:19] <logmsgbot>	 !log jayme@cumin1001 conftool action : set/pooled=yes; selector: name=kubernetes2010.codfw.wmnet
[16:54:44] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:55:21] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Revert "scap::dsh: temporarily exclude kubernetes2010" [puppet] - 10https://gerrit.wikimedia.org/r/960003 (owner: 10JMeybohm)
[16:55:44] <wikibugs>	 (03PS2) 10JMeybohm: Revert "scap::dsh: temporarily exclude kubernetes2010" [puppet] - 10https://gerrit.wikimedia.org/r/960003 (https://phabricator.wikimedia.org/T347267)
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230925T1700)
[17:00:05] <jouncebot>	 ryankemper: How many deployers does it take to do Wikidata Query Service weekly deploy deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230925T1700).
[17:04:16] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:05:42] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:07:26] <wikibugs>	 (03PS5) 10AOkoth: gitlab: swap replica records [dns] - 10https://gerrit.wikimedia.org/r/960633 (https://phabricator.wikimedia.org/T345590)
[17:07:36] <wikibugs>	 (03PS6) 10AOkoth: gitlab: swap replica records [dns] - 10https://gerrit.wikimedia.org/r/960633 (https://phabricator.wikimedia.org/T345590)
[17:12:30] <wikibugs>	 (03CR) 10Abijeet Patro: [V: 03+2] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/960582 (owner: 10L10n-bot)
[17:16:27] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: db2109 crashed - https://phabricator.wikimedia.org/T347318 (10Jhancock.wm) a:03Jhancock.wm    2023-09-25 16:05:43  SYS1001  System is turning off.    2023-09-25 16:05:43  SYS1003  System CPU Resetting.    2023-08-22 02:14:32  SYS1003  System CPU Resetting.  There's couldn't find...
[17:24:07] <jinxer-wm>	 (PuppetConstantChange) firing: (2) Puppet performing a change on every puppet run - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[17:39:04] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:42:47] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10CDanis) a:05Eevans→03darthmon_wmde
[17:43:12] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service,httpbb_kubernetes_mw-api-ext_hourly.service,httpbb_kubernetes_mw-api-int_hourly.service,httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:44:24] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[17:44:48] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:46:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:47:51] <wikibugs>	 (03CR) 10Subramanya Sastry: "This feels stalled for a while now ... anything needed on my end to move this forward? We are only using local dbs and none of the product" [puppet] - 10https://gerrit.wikimedia.org/r/957251 (https://phabricator.wikimedia.org/T345220) (owner: 10Ladsgroup)
[17:49:50] <wikibugs>	 (03PS1) 10Ebernhardson: k8s config: Provide zookeeper hostnames [puppet] - 10https://gerrit.wikimedia.org/r/960662
[17:50:06] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:50:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] k8s config: Provide zookeeper hostnames [puppet] - 10https://gerrit.wikimedia.org/r/960662 (owner: 10Ebernhardson)
[17:51:28] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:53:59] <wikibugs>	 (03PS2) 10Ebernhardson: k8s config: Provide zookeeper hostnames [puppet] - 10https://gerrit.wikimedia.org/r/960662
[17:54:23] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] k8s config: Provide zookeeper hostnames [puppet] - 10https://gerrit.wikimedia.org/r/960662 (owner: 10Ebernhardson)
[17:58:27] <wikibugs>	 (03PS5) 10Ebernhardson: Pull some flink config down into the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T346315)
[18:09:05] <wikibugs>	 (03PS3) 10Ebernhardson: k8s config: Provide zookeeper hostnames [puppet] - 10https://gerrit.wikimedia.org/r/960662
[18:11:24] <wikibugs>	 (03PS5) 10Bking: cloudelastic: new partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/960114 (https://phabricator.wikimedia.org/T342463)
[18:17:11] <wikibugs>	 (03PS1) 10Bking: wdqs: re-enable LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/960664 (https://phabricator.wikimedia.org/T347284)
[18:23:44] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:27:44] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[18:31:02] <wikibugs>	 (03CR) 10Ebernhardson: Pull some flink config down into the chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T346315) (owner: 10Ebernhardson)
[18:33:09] <wikibugs>	 (03Abandoned) 10Bking: wdqs: re-enable LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/960664 (https://phabricator.wikimedia.org/T347284) (owner: 10Bking)
[18:37:49] <wikibugs>	 (03PS7) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385)
[18:38:37] <wikibugs>	 (03PS1) 10Bking: trafficserver: use wdqs1015 as LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/960687 (https://phabricator.wikimedia.org/T347284)
[18:39:25] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/960687 (https://phabricator.wikimedia.org/T347284) (owner: 10Bking)
[18:41:50] <wikibugs>	 (03PS2) 10Bking: trafficserver: use wdqs1015 as LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/960687 (https://phabricator.wikimedia.org/T347284)
[18:44:57] <wikibugs>	 (03PS1) 10Marostegui: db2109.: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/960688 (https://phabricator.wikimedia.org/T347318)
[18:45:34] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db2109.codfw.wmnet with reason: Host crashed
[18:45:40] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] package_builder: add piuparts package [puppet] - 10https://gerrit.wikimedia.org/r/956968 (owner: 10BCornwall)
[18:45:48] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2109.codfw.wmnet with reason: Host crashed
[18:46:01] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10Patch-For-Review: db2109 crashed - https://phabricator.wikimedia.org/T347318 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=048e37aa-4014-4b71-85fd-37c023deeb00) set by marostegui@cumin1001 for 7 days, 0:00:00 on 1 host(s) and their services with reason: Ho...
[18:46:07] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2109.: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/960688 (https://phabricator.wikimedia.org/T347318) (owner: 10Marostegui)
[18:47:26] <wikibugs>	 10ops-codfw, 10DBA: db2109 crashed - https://phabricator.wikimedia.org/T347318 (10Marostegui) Downtimed it for a week
[18:48:58] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:50:22] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:52:07] <wikibugs>	 (03PS8) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385)
[18:53:00] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:55:07] <wikibugs>	 (03PS4) 10Ebernhardson: k8s config: Provide kafka and zookeeper hostnames [puppet] - 10https://gerrit.wikimedia.org/r/960662
[18:55:10] <wikibugs>	 (03PS3) 10Ebernhardson: flink-app: Provide kafka hosts as properties file [deployment-charts] - 10https://gerrit.wikimedia.org/r/959066
[18:55:52] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:57:56] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:59:02] <wikibugs>	 (03PS9) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385)
[18:59:22] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:04:57] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] trafficserver: use wdqs1015 as LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/960687 (https://phabricator.wikimedia.org/T347284) (owner: 10Bking)
[19:05:36] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] trafficserver: use wdqs1015 as LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/960687 (https://phabricator.wikimedia.org/T347284) (owner: 10Bking)
[19:07:02] <wikibugs>	 (03CR) 10AOkoth: [C: 03+1] gitlab: delay restore timer 30 minutes [puppet] - 10https://gerrit.wikimedia.org/r/959683 (owner: 10Jelto)
[19:07:24] <wikibugs>	 (03CR) 10AOkoth: [C: 03+1] gitlab: remove deprecated grafana feature [puppet] - 10https://gerrit.wikimedia.org/r/959689 (owner: 10Jelto)
[19:07:46] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:08:32] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:09:48] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.269 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:10:36] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:11:41] <wikibugs>	 (03PS10) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385)
[19:13:26] <wikibugs>	 10SRE-OnFire, 10Data-Platform-SRE, 10Discovery-Search, 10Wikimedia-Incident: 2023-09-20 Elasticsearch unavailable incident - https://phabricator.wikimedia.org/T346945 (10bking)
[19:13:28] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:13:37] <wikibugs>	 (03PS11) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385)
[19:14:24] <wikibugs>	 (03CR) 10Majavah: "This is causing Puppet to fail on some Cloud VPS hosts:" [puppet] - 10https://gerrit.wikimedia.org/r/959226 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway)
[19:14:50] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:16:22] <wikibugs>	 (03PS12) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385)
[19:16:39] <wikibugs>	 (03PS1) 10RLazarus: httpbb: Switch to a different entity for testwikidata [puppet] - 10https://gerrit.wikimedia.org/r/960693
[19:19:29] <wikibugs>	 (03PS13) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385)
[19:20:50] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[19:26:16] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[19:29:49] <wikibugs>	 (03PS14) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385)
[19:30:14] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) (owner: 10Andrew Bogott)
[19:32:42] <wikibugs>	 (03PS15) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385)
[19:35:25] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1 C: 03+2] mtail: Record bad requests for varnish SLI metrics [puppet] - 10https://gerrit.wikimedia.org/r/953725 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall)
[19:37:15] <wikibugs>	 (03PS16) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385)
[19:39:43] <wikibugs>	 (03PS17) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385)
[19:42:23] <wikibugs>	 (03PS18) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385)
[19:50:46] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:52:14] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:55:14] <wikibugs>	 (03CR) 10AOkoth: [C: 03+2] wikistats: drop some updates [puppet] - 10https://gerrit.wikimedia.org/r/956813 (owner: 10RhinosF1)
[19:55:48] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:56:16] <wikibugs>	 (03PS3) 10JHathaway: puppetserver: add comment on avoiding perma-diff for /var/lib/puppet/ssl [puppet] - 10https://gerrit.wikimedia.org/r/959235 (https://phabricator.wikimedia.org/T337970)
[19:57:43] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] "thanks for the review" [puppet] - 10https://gerrit.wikimedia.org/r/959235 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway)
[19:59:24] <wikibugs>	 (03PS3) 10DDesouza: Deploy Reader Demographics 2 pilot survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959826 (https://phabricator.wikimedia.org/T345951)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230925T2000).
[20:00:06] <jouncebot>	 danisztls and houseofm: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:06] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:00:18] <danisztls>	 o/
[20:01:03] <cjming>	 hi i can deploy
[20:02:01] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959826 (https://phabricator.wikimedia.org/T345951) (owner: 10DDesouza)
[20:03:40] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy Reader Demographics 2 pilot survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959826 (https://phabricator.wikimedia.org/T345951) (owner: 10DDesouza)
[20:03:57] <logmsgbot>	 !log cjming@deploy2002 Started scap: Backport for [[gerrit:959826|Deploy Reader Demographics 2 pilot survey (T345951)]]
[20:04:05] <stashbot>	 T345951: Deploy pilot on enwiki for Global Readers Demographic Survey - https://phabricator.wikimedia.org/T345951
[20:06:46] <Jdlrobson>	 also here sorry but it in wrong deploy window it seems?
[20:07:15] <Jdlrobson>	 ^ cjming i've addded now
[20:07:30] <danisztls>	 cjming: this change will be difficult to test as it only increases coverage
[20:07:30] <cjming>	 hi Jdlrobson :) sounds good
[20:08:16] <cjming>	 danisztls: should i go ahead and sync? or do you want to try to test?
[20:09:57] <danisztls>	 cjming: yep, go ahead
[20:12:50] <Superpes>	 Hi cjming I added a simple config patch too (if you have time after these deployments) :)
[20:12:55] <wikibugs>	 (03CR) 10Subramanya Sastry: [C: 03+1] Re-enable Extension:ParserMigration on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944978 (https://phabricator.wikimedia.org/T333179) (owner: 10Sbailey)
[20:13:20] <cjming>	 Superpes: sure np
[20:14:07] <Superpes>	 Thanks ;)
[20:15:54] <logmsgbot>	 !log cjming@deploy2002 cjming and dani: Backport for [[gerrit:959826|Deploy Reader Demographics 2 pilot survey (T345951)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[20:16:01] <stashbot>	 T345951: Deploy pilot on enwiki for Global Readers Demographic Survey - https://phabricator.wikimedia.org/T345951
[20:16:03] <logmsgbot>	 !log cjming@deploy2002 cjming and dani: Continuing with sync
[20:19:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:19:54] <cjming>	 houseofm: are you around for your patch?
[20:21:44] <danisztls>	 cjming: thanks!
[20:22:45] <cjming>	 danisztls: :) should be live shortly
[20:24:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (PUT deployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:25:15] <logmsgbot>	 !log cjming@deploy2002 Finished scap: Backport for [[gerrit:959826|Deploy Reader Demographics 2 pilot survey (T345951)]] (duration: 21m 18s)
[20:25:29] <stashbot>	 T345951: Deploy pilot on enwiki for Global Readers Demographic Survey - https://phabricator.wikimedia.org/T345951
[20:25:41] <cjming>	 Jdlrobson: i'll do yours next
[20:25:46] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[20:25:48] <Jdlrobson>	 cool!
[20:26:53] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Provide wordmarks/taglines for Wikibooks projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959872 (https://phabricator.wikimedia.org/T341251) (owner: 10Jdlrobson)
[20:27:14] <wikibugs>	 (03PS2) 10JHathaway: prometheus-postgres-exporter: install configs before service [puppet] - 10https://gerrit.wikimedia.org/r/959230 (https://phabricator.wikimedia.org/T346842)
[20:27:22] <wikibugs>	 (03PS4) 10Clare Ming: Fix white background for Wikibooks wordmarks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957908 (https://phabricator.wikimedia.org/T341251) (owner: 10Pikne)
[20:28:01] <wikibugs>	 (03CR) 10JHathaway: prometheus-postgres-exporter: install configs before service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959230 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway)
[20:28:13] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959230 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway)
[20:30:40] <wikibugs>	 (03PS3) 10Clare Ming: Provide wordmarks/taglines for Wikibooks projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959872 (https://phabricator.wikimedia.org/T341251) (owner: 10Jdlrobson)
[20:32:45] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Provide wordmarks/taglines for Wikibooks projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959872 (https://phabricator.wikimedia.org/T341251) (owner: 10Jdlrobson)
[20:33:58] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:34:19] <cjming>	 Jdlrobson: i'm trying to manually +2 your patches (rebase in between) so i can scap backport them together
[20:34:59] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Fix white background for Wikibooks wordmarks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957908 (https://phabricator.wikimedia.org/T341251) (owner: 10Pikne)
[20:35:26] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:35:43] <wikibugs>	 (03Merged) 10jenkins-bot: Fix white background for Wikibooks wordmarks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957908 (https://phabricator.wikimedia.org/T341251) (owner: 10Pikne)
[20:35:46] <wikibugs>	 (03Merged) 10jenkins-bot: Provide wordmarks/taglines for Wikibooks projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959872 (https://phabricator.wikimedia.org/T341251) (owner: 10Jdlrobson)
[20:36:33] <wikibugs>	 (03PS7) 10Clare Ming: Icons for special projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956502 (https://phabricator.wikimedia.org/T341242) (owner: 10Jdlrobson)
[20:37:46] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Icons for special projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956502 (https://phabricator.wikimedia.org/T341242) (owner: 10Jdlrobson)
[20:38:03] <Jdlrobson>	 cjming: sounds good
[20:38:14] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:38:27] <wikibugs>	 (03Merged) 10jenkins-bot: Icons for special projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956502 (https://phabricator.wikimedia.org/T341242) (owner: 10Jdlrobson)
[20:39:01] <logmsgbot>	 !log cjming@deploy2002 Started scap: Backport for [[gerrit:959872|Provide wordmarks/taglines for Wikibooks projects (T341251)]], [[gerrit:957908|Fix white background for Wikibooks wordmarks (T341251)]], [[gerrit:956502|Icons for special projects (T341242)]]
[20:39:11] <stashbot>	 T341242: Design: Get icons for Wikimedia special wikis (including chapters) - https://phabricator.wikimedia.org/T341242
[20:39:11] <stashbot>	 T341251: Deploy wordmarks/taglines for Wikibooks projects - https://phabricator.wikimedia.org/T341251
[20:39:26] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:39:38] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:40:02] <wikibugs>	 10SRE-OnFire, 10Data-Platform-SRE, 10Discovery-Search, 10Wikimedia-Incident: 2023-09-20 Elasticsearch unavailable incident - https://phabricator.wikimedia.org/T346945 (10bking)
[20:50:21] <wikibugs>	 (03CR) 10Fabfur: vanish: allow PURGE requests only from dedicated socket (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192) (owner: 10Fabfur)
[20:51:06] <wikibugs>	 (03PS9) 10Fabfur: vanish: allow PURGE requests only from dedicated socket [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192)
[20:51:09] <logmsgbot>	 !log cjming@deploy2002 pikne and cjming and jdlrobson: Backport for [[gerrit:959872|Provide wordmarks/taglines for Wikibooks projects (T341251)]], [[gerrit:957908|Fix white background for Wikibooks wordmarks (T341251)]], [[gerrit:956502|Icons for special projects (T341242)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes
[20:51:09] <logmsgbot>	 deployment (accessible via k8s-experimental XWD option)
[20:51:14] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:51:19] <stashbot>	 T341242: Design: Get icons for Wikimedia special wikis (including chapters) - https://phabricator.wikimedia.org/T341242
[20:51:19] <cjming>	 Jdlrobson: are you able to test?
[20:51:19] <stashbot>	 T341251: Deploy wordmarks/taglines for Wikibooks projects - https://phabricator.wikimedia.org/T341251
[20:51:50] <Jdlrobson>	 cjming: yep looking now
[20:53:18] <Jdlrobson>	 @cjming LGTM! please sync!
[20:53:28] <cjming>	 yay - syncing
[20:53:34] <logmsgbot>	 !log cjming@deploy2002 pikne and cjming and jdlrobson: Continuing with sync
[20:54:09] <cjming>	 Jdlrobson: i'm assuming i need to purge the files in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/957908 ?
[20:58:56] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:59:15] <Jdlrobson>	 cjming: yep i believe so
[20:59:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: (Dis)respected human, time to deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230925T2100). Please do the needful.
[21:00:24] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[21:01:04] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[21:02:52] <logmsgbot>	 !log cjming@deploy2002 Finished scap: Backport for [[gerrit:959872|Provide wordmarks/taglines for Wikibooks projects (T341251)]], [[gerrit:957908|Fix white background for Wikibooks wordmarks (T341251)]], [[gerrit:956502|Icons for special projects (T341242)]] (duration: 23m 50s)
[21:03:09] <stashbot>	 T341242: Design: Get icons for Wikimedia special wikis (including chapters) - https://phabricator.wikimedia.org/T341242
[21:03:10] <stashbot>	 T341251: Deploy wordmarks/taglines for Wikibooks projects - https://phabricator.wikimedia.org/T341251
[21:03:54] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[21:04:11] <cjming>	 Jdlrobson: ok - should be live - and i just purged the svgs
[21:04:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:04:41] <cjming>	 Superpes: if you're still around, i'll do yours now
[21:04:58] <wikibugs>	 (03PS3) 10Clare Ming: [fiwiki] Add an editautoreviewprotected level protecion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960201 (https://phabricator.wikimedia.org/T347069) (owner: 10Superpes15)
[21:08:32] <cjming>	 houseofm: Superpes: i'll hang out for a few minutes after which i'll close this backport window
[21:09:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:12:22] <cjming>	 !log end of UTC late backport window
[21:12:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:16:39] <dancy>	 I'm going to test some scap changes on the deploy server now.
[21:17:09] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[21:19:12] <wikibugs>	 (03PS1) 10JHathaway: nginx: mount lib on tmpfs vol in cloud [puppet] - 10https://gerrit.wikimedia.org/r/960708 (https://phabricator.wikimedia.org/T346842)
[21:19:21] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt wdqs1017-20 - jclark@cumin1001"
[21:20:11] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt wdqs1017-20 - jclark@cumin1001"
[21:20:11] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:20:47] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] nginx: add toggle for mounting lib on tmpfs vol (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959226 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway)
[21:21:09] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/960708 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway)
[21:22:03] <logmsgbot>	 !log dancy@deploy2002 Started scap: testing scap mods
[21:24:08] <jinxer-wm>	 (PuppetConstantChange) firing: (2) Puppet performing a change on every puppet run - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[21:25:24] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] nginx: mount lib on tmpfs vol in cloud [puppet] - 10https://gerrit.wikimedia.org/r/960708 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway)
[21:27:13] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[21:29:18] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt wdqs1017-20 - jclark@cumin1001"
[21:29:51] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host wdqs1022.mgmt.eqiad.wmnet with reboot policy FORCED
[21:29:52] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host wdqs1017.mgmt.eqiad.wmnet with reboot policy FORCED
[21:29:55] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host wdqs1023.mgmt.eqiad.wmnet with reboot policy FORCED
[21:29:58] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host wdqs1024.mgmt.eqiad.wmnet with reboot policy FORCED
[21:30:02] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt wdqs1017-20 - jclark@cumin1001"
[21:30:02] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:30:38] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host wdqs1018.mgmt.eqiad.wmnet with reboot policy FORCED
[21:30:51] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host wdqs1019.mgmt.eqiad.wmnet with reboot policy FORCED
[21:31:04] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host wdqs1024.mgmt.eqiad.wmnet with reboot policy FORCED
[21:31:29] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wdqs1024.mgmt.eqiad.wmnet with reboot policy FORCED
[21:31:31] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host wdqs1021.mgmt.eqiad.wmnet with reboot policy FORCED
[21:32:39] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host wdqs1024.mgmt.eqiad.wmnet with reboot policy FORCED
[21:32:41] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wdqs1024.mgmt.eqiad.wmnet with reboot policy FORCED
[21:33:08] <wikibugs>	 (03PS19) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385)
[21:35:52] <wikibugs>	 (03PS20) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385)
[21:36:43] <logmsgbot>	 !log dancy@deploy2002 Installing scap version "4.62.0" for 598 hosts
[21:37:51] <logmsgbot>	 !log dancy@deploy2002 Installation of scap version "4.62.0" completed for 598 hosts
[21:38:32] <logmsgbot>	 !log dancy@deploy2002 Started scap: testing scap mods
[21:40:50] <icinga-wm>	 RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:41:06] <wikibugs>	 (03PS21) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385)
[21:45:06] <icinga-wm>	 PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: imagecatalog_record.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:45:25] <logmsgbot>	 !log dancy@deploy2002 Started scap: testing scap mods
[21:46:33] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1018.mgmt.eqiad.wmnet with reboot policy FORCED
[21:46:48] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[21:46:52] <logmsgbot>	 !log dancy@deploy2002 Started scap: final test sync
[21:47:18] <wikibugs>	 (03PS22) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385)
[21:47:28] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10Jclark-ctr)
[21:47:48] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1023.mgmt.eqiad.wmnet with reboot policy FORCED
[21:47:57] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1021.mgmt.eqiad.wmnet with reboot policy FORCED
[21:48:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10Jclark-ctr)
[21:48:27] <wikibugs>	 (03PS1) 10Ryan Kemper: elastic: don't alert p95 if request volume low [puppet] - 10https://gerrit.wikimedia.org/r/960712 (https://phabricator.wikimedia.org/T347341)
[21:48:44] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10Jclark-ctr)
[21:49:14] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[21:49:40] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[21:50:38] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[21:51:09] <wikibugs>	 (03PS23) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385)
[21:51:30] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1019.mgmt.eqiad.wmnet with reboot policy FORCED
[21:51:32] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1017.mgmt.eqiad.wmnet with reboot policy FORCED
[21:52:12] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10Jclark-ctr)
[21:52:29] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1022.mgmt.eqiad.wmnet with reboot policy FORCED
[21:52:31] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1024.mgmt.eqiad.wmnet with reboot policy FORCED
[21:52:42] <wikibugs>	 (03PS24) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385)
[21:53:09] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10Jclark-ctr)
[21:53:41] <wikibugs>	 (03CR) 10Bking: [C: 03+1] elastic: don't alert p95 if request volume low [puppet] - 10https://gerrit.wikimedia.org/r/960712 (https://phabricator.wikimedia.org/T347341) (owner: 10Ryan Kemper)
[21:54:00] <wikibugs>	 10SRE-OnFire, 10Data-Platform-SRE, 10Discovery-Search, 10Wikimedia-Incident: 2023-09-20 Elasticsearch unavailable incident - https://phabricator.wikimedia.org/T346945 (10bking)
[21:54:38] <wikibugs>	 (03PS25) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385)
[21:56:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10Jclark-ctr)
[21:57:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PUT deployments) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:57:49] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] elastic: don't alert p95 if request volume low [puppet] - 10https://gerrit.wikimedia.org/r/960712 (https://phabricator.wikimedia.org/T347341) (owner: 10Ryan Kemper)
[21:58:04] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1021.eqiad.wmnet']
[21:58:10] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1022.eqiad.wmnet']
[21:58:37] <wikibugs>	 (03PS1) 10Andrew Bogott: Update fake password keys for mysql::dump [labs/private] - 10https://gerrit.wikimedia.org/r/960713
[21:58:49] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1024.eqiad.wmnet']
[21:58:56] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1022.eqiad.wmnet']
[21:59:00] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1023.eqiad.wmnet']
[21:59:27] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1018.eqiad.wmnet']
[21:59:39] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1017.eqiad.wmnet']
[21:59:52] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs1022.eqiad.wmnet']
[21:59:53] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs1022.eqiad.wmnet']
[22:00:33] <wikibugs>	 (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Update fake password keys for mysql::dump [labs/private] - 10https://gerrit.wikimedia.org/r/960713 (owner: 10Andrew Bogott)
[22:01:53] <logmsgbot>	 !log dancy@deploy2002 Finished scap: final test sync (duration: 15m 00s)
[22:02:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (5) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[22:03:06] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1019.eqiad.wmnet']
[22:04:27] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1022']
[22:04:41] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs1022']
[22:05:14] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1022']
[22:05:25] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs1022']
[22:07:01] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1022']
[22:07:10] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs1022']
[22:07:26] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1024.eqiad.wmnet']
[22:07:43] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1021.eqiad.wmnet']
[22:07:45] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1018.eqiad.wmnet']
[22:07:59] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1017.eqiad.wmnet']
[22:08:53] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1017.eqiad.wmnet with OS bullseye
[22:08:59] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1017.eqiad.wmnet with OS bullseye
[22:09:21] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1023.eqiad.wmnet']
[22:11:29] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1019.eqiad.wmnet']
[22:11:42] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1018.eqiad.wmnet with OS bullseye
[22:11:43] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1019.eqiad.wmnet with OS bullseye
[22:11:49] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1018.eqiad.wmnet with OS bullseye
[22:11:52] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1019.eqiad.wmnet with OS bullseye
[22:13:34] <Superpes>	 cjming Sorry my internet completely died will re-schedule it for tomorrow :/
[22:13:57] <Superpes>	 Many thanks for your availability btw :D
[22:13:58] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1021.eqiad.wmnet with OS bullseye
[22:14:04] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1021.eqiad.wmnet with OS bullseye
[22:14:04] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1022.eqiad.wmnet with OS bullseye
[22:14:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1022.eqiad.wmnet with OS bullseye
[22:14:12] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1023.eqiad.wmnet with OS bullseye
[22:14:18] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1024.eqiad.wmnet with OS bullseye
[22:14:19] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1023.eqiad.wmnet with OS bullseye
[22:14:24] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1024.eqiad.wmnet with OS bullseye
[22:15:16] <wikibugs>	 (03PS1) 10Ryan Kemper: Revert "elastic: don't alert p95 if request volume low" [puppet] - 10https://gerrit.wikimedia.org/r/960728
[22:16:31] <wikibugs>	 (03PS2) 10Ryan Kemper: Revert "elastic: don't alert p95 if request volume low" [puppet] - 10https://gerrit.wikimedia.org/r/960728 (https://phabricator.wikimedia.org/T347341)
[22:16:48] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] Revert "elastic: don't alert p95 if request volume low" [puppet] - 10https://gerrit.wikimedia.org/r/960728 (https://phabricator.wikimedia.org/T347341) (owner: 10Ryan Kemper)
[22:20:50] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:21:11] <wikibugs>	 (03PS26) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385)
[22:22:18] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:23:15] <wikibugs>	 (03PS1) 10Ryan Kemper: elastic: don't alert p95 if request volume low [puppet] - 10https://gerrit.wikimedia.org/r/960717 (https://phabricator.wikimedia.org/T347341)
[22:23:44] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:24:38] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] elastic: don't alert p95 if request volume low [puppet] - 10https://gerrit.wikimedia.org/r/960717 (https://phabricator.wikimedia.org/T347341) (owner: 10Ryan Kemper)
[22:25:42] <wikibugs>	 (03PS27) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385)
[22:35:17] <wikibugs>	 (03PS28) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385)
[22:35:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) (owner: 10Andrew Bogott)
[22:37:51] <wikibugs>	 (03PS29) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385)
[22:39:02] <wikibugs>	 (03PS1) 10Ryan Kemper: elastic: standardize eqiad & codfw p95 metrics [puppet] - 10https://gerrit.wikimedia.org/r/960721 (https://phabricator.wikimedia.org/T347341)
[22:39:24] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:40:13] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+1] elastic: standardize eqiad & codfw p95 metrics [puppet] - 10https://gerrit.wikimedia.org/r/960721 (https://phabricator.wikimedia.org/T347341) (owner: 10Ryan Kemper)
[22:40:25] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] elastic: standardize eqiad & codfw p95 metrics [puppet] - 10https://gerrit.wikimedia.org/r/960721 (https://phabricator.wikimedia.org/T347341) (owner: 10Ryan Kemper)
[22:40:50] <wikibugs>	 (03PS30) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385)
[22:40:57] <wikibugs>	 (03PS1) 10Andrea Denisse: prometheus: Enable selective scraping for Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/960723 (https://phabricator.wikimedia.org/T346656)
[22:46:13] <wikibugs>	 10SRE-OnFire, 10Data-Platform-SRE, 10Discovery-Search, 10Wikimedia-Incident: 2023-09-20 Elasticsearch unavailable incident - https://phabricator.wikimedia.org/T346945 (10RKemper)
[22:48:13] <wikibugs>	 (03CR) 10Andrea Denisse: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/958807 (https://phabricator.wikimedia.org/T346656) (owner: 10Andrea Denisse)
[22:49:43] <wikibugs>	 (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/958807/43582/" [puppet] - 10https://gerrit.wikimedia.org/r/958807 (https://phabricator.wikimedia.org/T346656) (owner: 10Andrea Denisse)
[22:51:14] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:51:30] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:52:00] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:52:56] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:53:26] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:57:15] <wikibugs>	 (03CR) 10Andrea Denisse: "Patch #958807 must be merged and applied on all hosts before merging and applying." [puppet] - 10https://gerrit.wikimedia.org/r/960723 (https://phabricator.wikimedia.org/T346656) (owner: 10Andrea Denisse)
[23:11:08] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[23:12:16] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[23:15:48] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[23:15:48] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[23:18:14] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[23:26:16] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[23:29:08] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1017.eqiad.wmnet with OS bullseye
[23:29:15] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1017.eqiad.wmnet with OS bullseye executed with errors: - wdqs1017 (**FAIL**)   - Remove...
[23:31:56] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1018.eqiad.wmnet with OS bullseye
[23:31:59] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1019.eqiad.wmnet with OS bullseye
[23:32:02] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1018.eqiad.wmnet with OS bullseye executed with errors: - wdqs1018 (**FAIL**)   - Remove...
[23:32:06] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1019.eqiad.wmnet with OS bullseye executed with errors: - wdqs1019 (**FAIL**)   - Remove...
[23:33:23] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1017']
[23:33:45] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs1017']
[23:33:53] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1017']
[23:34:12] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1021.eqiad.wmnet with OS bullseye
[23:34:18] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1021.eqiad.wmnet with OS bullseye executed with errors: - wdqs1021 (**FAIL**...
[23:34:20] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1022.eqiad.wmnet with OS bullseye
[23:34:25] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1022.eqiad.wmnet with OS bullseye executed with errors: - wdqs1022 (**FAIL**...
[23:34:25] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1023.eqiad.wmnet with OS bullseye
[23:34:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1023.eqiad.wmnet with OS bullseye executed with errors: - wdqs1023 (**FAIL**...
[23:34:32] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1024.eqiad.wmnet with OS bullseye
[23:34:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1024.eqiad.wmnet with OS bullseye executed with errors: - wdqs1024 (**FAIL**...
[23:34:42] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1018']
[23:34:52] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1019']
[23:35:00] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1021']
[23:35:08] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1023']
[23:35:27] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1023']
[23:35:29] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1024']
[23:35:49] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs1018']
[23:35:50] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs1019']
[23:35:55] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1023']
[23:35:56] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1023']
[23:36:10] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1023']
[23:36:18] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[23:36:20] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1023']
[23:36:40] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1023']
[23:36:55] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1018']
[23:37:15] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1019']
[23:37:36] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1023']
[23:37:41] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1023']
[23:37:42] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[23:37:43] <wikibugs>	 (03PS31) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385)
[23:37:49] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1023']
[23:38:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) (owner: 10Andrew Bogott)
[23:40:04] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1023']
[23:40:08] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1023']
[23:40:13] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1023']
[23:40:18] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1023']
[23:40:32] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[23:40:51] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1017']
[23:41:26] <wikibugs>	 (03PS32) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385)
[23:41:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) (owner: 10Andrew Bogott)
[23:42:17] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1021']
[23:42:39] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1023']
[23:42:48] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1023']
[23:42:58] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1023']
[23:42:58] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1024']
[23:43:04] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1023']
[23:43:11] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1022']
[23:43:33] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1023']
[23:43:38] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1023']
[23:43:57] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1017.eqiad.wmnet with OS bullseye
[23:44:03] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1017.eqiad.wmnet with OS bullseye
[23:44:26] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1018']
[23:44:28] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1019']
[23:44:46] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1018.eqiad.wmnet with OS bullseye
[23:44:48] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[23:44:50] <wikibugs>	 (03PS33) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385)
[23:44:52] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1019.eqiad.wmnet with OS bullseye
[23:44:55] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1018.eqiad.wmnet with OS bullseye
[23:44:59] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1019.eqiad.wmnet with OS bullseye
[23:45:03] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1024.eqiad.wmnet with OS bullseye
[23:45:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1024.eqiad.wmnet with OS bullseye
[23:45:17] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1021.eqiad.wmnet with OS bullseye
[23:45:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) (owner: 10Andrew Bogott)
[23:45:24] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1021.eqiad.wmnet with OS bullseye
[23:45:29] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1023']
[23:45:36] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1023']
[23:48:52] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1022']
[23:49:35] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1022.eqiad.wmnet with OS bullseye
[23:49:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1022.eqiad.wmnet with OS bullseye
[23:50:44] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1023']
[23:50:51] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1023']
[23:51:31] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host wdqs1023.mgmt.eqiad.wmnet with reboot policy FORCED
[23:55:22] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wdqs1023.mgmt.eqiad.wmnet with reboot policy FORCED
[23:56:36] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[23:57:56] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase