[00:22:29] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:28:09] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:29:35] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:34:23] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:38:28] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/959980 [00:38:34] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/959980 (owner: 10TrainBranchBot) [00:39:01] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [00:42:13] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:42:21] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:48:29] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:49:09] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [00:49:55] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:53:15] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/959980 (owner: 10TrainBranchBot) [01:19:07] (PuppetConstantChange) firing: (2) Puppet performing a change on every puppet run - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [01:31:13] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:33:37] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [01:48:57] PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:50:49] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [01:51:47] RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:57:27] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:58:53] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:07:31] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:08:44] (JobUnavailable) firing: (6) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:10:25] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:23:44] (JobUnavailable) firing: (6) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:30:45] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:35:07] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:38:37] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:38:44] (JobUnavailable) firing: (6) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:40:05] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:40:45] (03PS2) 10DDesouza: Deploy Reader Demographics 2 pilot survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959826 (https://phabricator.wikimedia.org/T345951) [02:43:15] 10SRE, 10RESTBase, 10RESTBase-API, 10Traffic: REST API not returning latest page when queried title is a redirect - https://phabricator.wikimedia.org/T346579 (10Brycehughes) Ah ok. Thanks for checking. I suppose this can just sit open for a bit. I have a workaround, it just involves me hitting the API 2-3... [02:48:49] (KubernetesCalicoDown) firing: kubernetes2010.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2010.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [02:49:15] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:52:09] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:58:03] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:03:03] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:26:15] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:46:07] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [03:54:49] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [04:07:49] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [04:32:13] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [04:42:37] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:44:03] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:45:04] (03PS2) 10KartikMistry: Update cxserver to 2023-09-13-074325-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/959156 (https://phabricator.wikimedia.org/T346045) [04:45:36] * kart_ updating cxserver. Minor changes. [04:49:10] OK. I'll hold this till tomorrow. Are mesh changes OK to deploy (seems already deployed in staging) godog ? [04:52:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [04:55:51] <_joe_> kart_: you really need to read ops@; there was an email from Janis explaining those changes are safe to deploy [04:56:41] <_joe_> kart_: that's where we announce such changes; we're ofc open to suggestions on how to make such communications stand out [04:57:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [04:58:34] (03PS1) 10Ilias Sarantopoulos: ml-services: remove old eswikiquote and eswikibooks models [deployment-charts] - 10https://gerrit.wikimedia.org/r/960234 (https://phabricator.wikimedia.org/T342266) [04:58:36] My bad. It was a month back and I also conveyed that to team :/ [04:58:57] _joe_: sorry for noise. [04:59:09] <_joe_> kart_: np :P [04:59:43] <_joe_> I didn't realize it was almost a month ago, sheesh [05:00:09] That means we've not deployed cxserver since then :) [05:00:58] (03PS2) 10Ilias Sarantopoulos: ml-services: remove old eswikiquote and eswikibooks models [deployment-charts] - 10https://gerrit.wikimedia.org/r/960234 (https://phabricator.wikimedia.org/T342266) [05:01:12] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-09-13-074325-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/959156 (https://phabricator.wikimedia.org/T346045) (owner: 10KartikMistry) [05:02:21] (03Merged) 10jenkins-bot: Update cxserver to 2023-09-13-074325-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/959156 (https://phabricator.wikimedia.org/T346045) (owner: 10KartikMistry) [05:08:05] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [05:08:31] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [05:12:30] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [05:13:03] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [05:22:28] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [05:22:56] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [05:23:23] !log Updated cxserver to 2023-09-13-074325-production (T346045) [05:23:44] (PuppetConstantChange) firing: (2) Puppet performing a change on every puppet run - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [05:24:22] !log Updated cxserver to 2023-09-13-074325-production (T346045) [05:24:33] hmm? [05:47:53] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [05:47:53] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:49:21] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:52:37] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:53:23] PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:54:01] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:54:49] RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:04:17] (PoolcounterFullQueues) firing: Full queues for poolcounter2003:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:17] (PoolcounterFullQueues) resolved: Full queues for poolcounter2003:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:14:09] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 35008 [06:14:34] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 35008 [06:28:35] PROBLEM - Wikitech and wt-static content in sync on wikitech-static.wikimedia.org is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (208821s 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [06:38:46] (JobUnavailable) firing: (4) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:46:43] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:46:47] PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:46:53] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:48:15] RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:48:19] (03CR) 10Ayounsi: [C: 03+2] Block inbound RAs on the routers [homer/public] - 10https://gerrit.wikimedia.org/r/959732 (https://phabricator.wikimedia.org/T334916) (owner: 10Ayounsi) [06:48:41] (03CR) 10Muehlenhoff: [C: 03+1] mcrouter: Specify missing CXXFLAGS (031 comment) [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/860584 (owner: 10TK-999) [06:48:49] (KubernetesCalicoDown) firing: kubernetes2010.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2010.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [06:49:39] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50569 bytes in 2.959 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:50:21] (03Merged) 10jenkins-bot: Block inbound RAs on the routers [homer/public] - 10https://gerrit.wikimedia.org/r/959732 (https://phabricator.wikimedia.org/T334916) (owner: 10Ayounsi) [06:50:53] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.271 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:55:50] (03CR) 10Majavah: [C: 03+2] hieradata: drop dmz_cidr excemptions for cloudmetrics1003/4 [puppet] - 10https://gerrit.wikimedia.org/r/960028 (https://phabricator.wikimedia.org/T326266) (owner: 10Majavah) [06:55:54] (03CR) 10Muehlenhoff: [C: 03+2] firewall: Default provider to none [puppet] - 10https://gerrit.wikimedia.org/r/960011 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [06:56:24] taavi: I'll merge your change along, ok? [06:56:27] yes please [06:56:53] ack, merged now [07:00:06] (03CR) 10Muehlenhoff: [C: 03+2] LVS: Set profile::firewall::provider: none [puppet] - 10https://gerrit.wikimedia.org/r/959954 (owner: 10Muehlenhoff) [07:00:06] Amir1, Urbanecm, and taavi: Dear deployers, time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230925T0700). [07:00:06] Sohom_Datta: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:39] (03PS3) 10Muehlenhoff: profile::cumin::cloud_target: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/959179 [07:00:47] o/ [07:02:23] o/ I can deploy [07:03:03] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [extensions/PageTriage] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959986 (https://phabricator.wikimedia.org/T345496) (owner: 10Sohom Datta) [07:04:25] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959179 (owner: 10Muehlenhoff) [07:06:10] !log roll out "Block inbound RAs on the routers" - T334916 [07:06:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:17] T334916: Juniper RA receive bug CVE-2023-28981 - https://phabricator.wikimedia.org/T334916 [07:07:45] (03PS4) 10Giuseppe Lavagetto: modules: add base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/959803 (https://phabricator.wikimedia.org/T343025) [07:07:47] (03PS2) 10Giuseppe Lavagetto: thumbor: use base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/959948 (https://phabricator.wikimedia.org/T343025) [07:07:49] (03PS1) 10Giuseppe Lavagetto: apertium: use base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/960543 (https://phabricator.wikimedia.org/T343025) [07:08:45] (03CR) 10CI reject: [V: 04-1] apertium: use base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/960543 (https://phabricator.wikimedia.org/T343025) (owner: 10Giuseppe Lavagetto) [07:15:28] (03Merged) 10jenkins-bot: Make sure different key values are handled while submitting [extensions/PageTriage] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959986 (https://phabricator.wikimedia.org/T345496) (owner: 10Sohom Datta) [07:16:30] !log taavi@deploy2002 Started scap: Backport for [[gerrit:959986|Make sure different key values are handled while submitting (T345496)]] [07:16:39] T345496: If a user tries to place two of the same tag, should show a warning or silently delete one tag - https://phabricator.wikimedia.org/T345496 [07:20:27] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [07:20:34] (03CR) 10Muehlenhoff: puppetdb: preseed to avoid creating database users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959231 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [07:22:21] !log kevinbazira@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [07:26:16] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:26:52] (03PS1) 10Elukey: Add nodejs 18 images on Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/960544 [07:27:46] (03PS2) 10Elukey: Add nodejs 18 images on Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/960544 [07:29:25] !log taavi@deploy2002 taavi and soda: Backport for [[gerrit:959986|Make sure different key values are handled while submitting (T345496)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [07:29:29] finally [07:29:32] Sohom_Datta: ^ please test [07:29:32] T345496: If a user tries to place two of the same tag, should show a warning or silently delete one tag - https://phabricator.wikimedia.org/T345496 [07:31:05] On it :) [07:35:06] (03PS1) 10Urbanecm: growth: Enable section-image recommendations on 10 new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960545 (https://phabricator.wikimedia.org/T345940) [07:35:52] (03CR) 10Effie Mouzeli: [C: 03+2] k8s: Fix dependencies for resources requiring kube user [puppet] - 10https://gerrit.wikimedia.org/r/959722 (owner: 10JMeybohm) [07:36:23] (03CR) 10Majavah: [C: 03+2] cr-cloud: Drop cloudmetrics excemptions [homer/public] - 10https://gerrit.wikimedia.org/r/960027 (https://phabricator.wikimedia.org/T326266) (owner: 10Majavah) [07:36:26] Sohom_Datta looks ok to me : https://en.wikipedia.org/wiki/Emile_van_Rouveroy_van_Nieuwaal [07:37:02] (03Merged) 10jenkins-bot: cr-cloud: Drop cloudmetrics excemptions [homer/public] - 10https://gerrit.wikimedia.org/r/960027 (https://phabricator.wikimedia.org/T326266) (owner: 10Majavah) [07:37:22] !log update eqsin-ulsfo tranport link ospf metrics to match the new latency of 175ms [07:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:45] taavi: Looks good :) per MPGuy [07:37:57] thanks, syncing [07:37:59] !log taavi@deploy2002 taavi and soda: Continuing with sync [07:38:43] (03CR) 10JMeybohm: [V: 03+1] prometheus::k8s: Discover calico-felix targets from k8s api (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/960049 (https://phabricator.wikimedia.org/T346915) (owner: 10JMeybohm) [07:40:17] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/960544 (owner: 10Elukey) [07:40:19] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] prometheus::k8s: Discover calico-felix targets from k8s api [puppet] - 10https://gerrit.wikimedia.org/r/960049 (https://phabricator.wikimedia.org/T346915) (owner: 10JMeybohm) [07:44:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:46:42] (03CR) 10Effie Mouzeli: [C: 03+2] Add the configuration for the new wikikube hosts in eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/958809 (https://phabricator.wikimedia.org/T346714) (owner: 10Giuseppe Lavagetto) [07:47:15] (03Merged) 10jenkins-bot: Add the configuration for the new wikikube hosts in eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/958809 (https://phabricator.wikimedia.org/T346714) (owner: 10Giuseppe Lavagetto) [07:47:26] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:959986|Make sure different key values are handled while submitting (T345496)]] (duration: 30m 55s) [07:47:33] T345496: If a user tries to place two of the same tag, should show a warning or silently delete one tag - https://phabricator.wikimedia.org/T345496 [07:47:40] Sohom_Datta: MPGuy2824: your change is now live [07:48:23] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:49:08] !log drop cloudmetrics exceptions from cr firewall ACLs https://gerrit.wikimedia.org/r/c/operations/homer/public/+/960027 T326266 [07:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:14] T326266: Remove the WMCS statsd/Graphite service - https://phabricator.wikimedia.org/T326266 [07:49:23] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:49:35] (KubernetesAPILatency) resolved: (5) High Kubernetes API latency (PUT deployments) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:50:34] (KubernetesCalicoDown) firing: kubernetes2010.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2010.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:50:49] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:51:17] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:53:01] taavi danke [07:53:15] Thank you :) [07:58:34] (KubernetesCalicoDown) resolved: kubernetes2010.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2010.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:58:49] (KubernetesCalicoDown) firing: kubernetes2010.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2010.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:00:04] (KubernetesCalicoDown) resolved: kubernetes2010.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2010.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:01:13] !log cordoning kubernetes2010 [08:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:49] (KubernetesCalicoDown) firing: kubernetes2010.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2010.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:07:59] jouncebot: next [08:07:59] In 1 hour(s) and 52 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230925T1000) [08:08:04] jouncebot: nw [08:08:07] jouncebot: now [08:08:08] No deployments scheduled for the next 1 hour(s) and 51 minute(s) [08:11:57] 10SRE, 10Infrastructure-Foundations, 10netops: Juniper RA receive bug CVE-2023-28981 - https://phabricator.wikimedia.org/T334916 (10ayounsi) 05Open→03Resolved a:03ayounsi Deployed [08:14:19] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958962 (owner: 10Muehlenhoff) [08:18:09] (03CR) 10MVernon: [C: 03+1] thanos: remove thanos components from thanos::frontend role [puppet] - 10https://gerrit.wikimedia.org/r/956906 (https://phabricator.wikimedia.org/T346143) (owner: 10Filippo Giunchedi) [08:18:53] (PuppetDisabled) firing: (2) Puppet disabled on puppetdb1002:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=puppet&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [08:19:03] (KubernetesCalicoDown) resolved: kubernetes2010.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2010.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:21:42] (03CR) 10JMeybohm: [C: 03+1] kubernetes: add kubernetes10[27-56] to wikikube [puppet] - 10https://gerrit.wikimedia.org/r/958810 (https://phabricator.wikimedia.org/T346714) (owner: 10Giuseppe Lavagetto) [08:22:46] (03PS1) 10Muehlenhoff: Mark mediawiki-testers as deprecated [puppet] - 10https://gerrit.wikimedia.org/r/960546 (https://phabricator.wikimedia.org/T276465) [08:22:55] (03CR) 10Effie Mouzeli: [C: 03+2] kubernetes: add kubernetes10[27-56] to wikikube [puppet] - 10https://gerrit.wikimedia.org/r/958810 (https://phabricator.wikimedia.org/T346714) (owner: 10Giuseppe Lavagetto) [08:24:40] (03PS1) 10Muehlenhoff: Mark pentesters as deprecated [puppet] - 10https://gerrit.wikimedia.org/r/960547 (https://phabricator.wikimedia.org/T276465) [08:27:44] !log draining kubernetes2010.codfw.wmnet - T347267 [08:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:51] T347267: kubernetes2010 down - https://phabricator.wikimedia.org/T347267 [08:28:18] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add nodejs 18 images on Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/960544 (owner: 10Elukey) [08:28:47] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:30:15] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:30:38] (03PS1) 10Muehlenhoff: Remove traceback-roots [puppet] - 10https://gerrit.wikimedia.org/r/960548 (https://phabricator.wikimedia.org/T276465) [08:31:20] 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations, 10Patch-For-Review: puppet admin module: Assign approvers to unix groups - https://phabricator.wikimedia.org/T276465 (10MoritzMuehlenhoff) [08:39:03] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on kubernetes2010.codfw.wmnet with reason: host is down [08:39:19] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on kubernetes2010.codfw.wmnet with reason: host is down [08:43:16] !log jayme@cumin1001 conftool action : set/pooled=no; selector: name=kubernetes2010.* [08:43:44] !log jayme@cumin1001 conftool action : set/pooled=no; selector: name=kubernetes2010.* - T347267 [08:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:51] T347267: kubernetes2010 down - https://phabricator.wikimedia.org/T347267 [08:44:03] 10ops-codfw, 10DC-Ops, 10serviceops: kubernetes2010 down - https://phabricator.wikimedia.org/T347267 (10JMeybohm) Hey DC-Ops, could you please check on kubernetes2010 [08:45:07] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:46:35] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:48:08] (03PS1) 10Effie Mouzeli: site.pp: fix typo for new kubernetes hosts [puppet] - 10https://gerrit.wikimedia.org/r/960549 [08:48:51] (03CR) 10JMeybohm: [C: 03+1] site.pp: fix typo for new kubernetes hosts [puppet] - 10https://gerrit.wikimedia.org/r/960549 (owner: 10Effie Mouzeli) [08:48:59] (03PS3) 10Elukey: profile::trafficserver::backend: switch ores traffic to ores-legacy [puppet] - 10https://gerrit.wikimedia.org/r/959762 (https://phabricator.wikimedia.org/T341696) [08:49:05] (03CR) 10Effie Mouzeli: [C: 03+2] site.pp: fix typo for new kubernetes hosts [puppet] - 10https://gerrit.wikimedia.org/r/960549 (owner: 10Effie Mouzeli) [08:53:52] vgutierrez: hey, from traffic side: Is it fine to just merge this patch? https://gerrit.wikimedia.org/r/c/operations/puppet/+/959762 [08:54:10] or should I do some dance like disabling puppet on lvs or etc [08:54:18] (KubernetesAPILatency) firing: High Kubernetes API latency (POST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:54:42] Amir1: in theory we should just run puppet on the cp nodes, or let it run and see traffic gradually migrates [08:55:12] (03PS1) 10Isabelle Hurbain-Palatin: Enable Parsoid support for Kartographer on enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960552 (https://phabricator.wikimedia.org/T342871) [08:55:16] I don't see anything ongoing for traffic on sal [08:55:36] yeah [08:55:57] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:56:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on 16 hosts with reason: Schema change [08:56:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on 16 hosts with reason: Schema change [08:57:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance [08:57:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance [08:57:21] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:57:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 15 hosts with reason: Maintenance [08:57:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 15 hosts with reason: Maintenance [08:57:42] (03PS4) 10Ladsgroup: profile::trafficserver::backend: switch ores traffic to ores-legacy [puppet] - 10https://gerrit.wikimedia.org/r/959762 (https://phabricator.wikimedia.org/T341696) (owner: 10Elukey) [08:57:45] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:57:48] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] profile::trafficserver::backend: switch ores traffic to ores-legacy [puppet] - 10https://gerrit.wikimedia.org/r/959762 (https://phabricator.wikimedia.org/T341696) (owner: 10Elukey) [08:58:52] !log migrate ores.wikimedia.org's ATS backend to ores-legacy.discovery.wmnet (k8s app) - This will drain traffic to ORES bare metal nodes - T341696 [08:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:59] T341696: Zero traffic on bare metal ORES servers - https://phabricator.wikimedia.org/T341696 [08:59:01] Amir1: logged --^ [08:59:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:59:26] !log by the power vested in my be Chris Albon and ML team, I now pronounce ORES dead. [08:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:42] elukey: That's logging ^ :P [08:59:52] :D :D :D :D [09:00:13] oh, wow [09:00:19] is that true? [09:01:03] the bare metal infra is depooled now, calls still go to lift wing via an adapter service [09:01:27] but plug of that will be also pulled eventually [09:01:28] 10SRE, 10Infrastructure-Foundations, 10netops: Juniper RA receive bug CVE-2023-28981 - https://phabricator.wikimedia.org/T334916 (10ayounsi) This might need to be rolled back the day we start doing BGP unnumbered between spine and leaf as it seems to rely on it: https://www.theasciiconstruct.com/post/junos-b... [09:01:31] jynus: we have https://ores-legacy.wikimedia.org/ that is on k8s, and it calls lift wing behind the scenes [09:01:50] (03CR) 10Mabualruz: [C: 03+1] Enable Parsoid support for Kartographer on enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960552 (https://phabricator.wikimedia.org/T342871) (owner: 10Isabelle Hurbain-Palatin) [09:02:11] jynus: the goal is also to deprecate ores-legacy once everybody is on Lift Wing [09:02:25] (we do it in two steps to drain ores bare metal and decom those servers) [09:03:23] if you need help decommissioning those server, you know who you gonna call elukey ? [09:04:05] the part I'm happy about is that mw support of ores already switched to LW directly without even needing to go through adapter [09:04:13] Amir1: you can remove the "if you need" part, you have to do it with me and Tobias :) [09:04:39] awesome. Just drop me the tickets :D [09:05:03] yep! Now I'll keep watching the ores-legacy dashboard for troubles, I know some will arise [09:06:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [09:06:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [09:06:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 13 hosts with reason: Maintenance [09:06:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 13 hosts with reason: Maintenance [09:08:34] (03PS3) 10Giuseppe Lavagetto: thumbor: use base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/959948 (https://phabricator.wikimedia.org/T343025) [09:08:36] (03PS2) 10Giuseppe Lavagetto: apertium: use base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/960543 (https://phabricator.wikimedia.org/T343025) [09:08:52] RECOVERY - Disk space on archiva1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [09:09:48] (03CR) 10CI reject: [V: 04-1] apertium: use base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/960543 (https://phabricator.wikimedia.org/T343025) (owner: 10Giuseppe Lavagetto) [09:11:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1183.eqiad.wmnet with reason: Maintenance [09:12:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1183.eqiad.wmnet with reason: Maintenance [09:12:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 14 hosts with reason: Maintenance [09:12:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 14 hosts with reason: Maintenance [09:18:27] (PrometheusRuleEvaluationFailures) firing: Prometheus rule evaluation failures (instance prometheus1005:9900) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [09:18:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [09:18:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 14 hosts with reason: Maintenance [09:19:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 14 hosts with reason: Maintenance [09:19:32] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1030.eqiad.wmnet with OS bullseye [09:19:54] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1031.eqiad.wmnet with OS bullseye [09:20:03] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1032.eqiad.wmnet with OS bullseye [09:20:34] (KubernetesCalicoDown) firing: (6) kubernetes1028.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:20:47] (03CR) 10Btullis: "Looks good in general. Thanks brouberol. I left one genuine question about paths." [cookbooks] - 10https://gerrit.wikimedia.org/r/959720 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol) [09:22:35] (03CR) 10Jelto: [V: 03+1 C: 03+2] peopleweb: switch rsync source and dest between eqiad and codfw [puppet] - 10https://gerrit.wikimedia.org/r/959690 (https://phabricator.wikimedia.org/T345618) (owner: 10Jelto) [09:22:47] (03CR) 10Brouberol: [Kafka] Use broker in-sync status as a gate between broker restarts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/959720 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol) [09:23:02] (03CR) 10Jelto: [C: 03+2] switch peopleweb from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/959693 (https://phabricator.wikimedia.org/T345618) (owner: 10Jelto) [09:23:07] (03PS2) 10Jelto: switch peopleweb from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/959693 (https://phabricator.wikimedia.org/T345618) [09:23:27] (PrometheusRuleEvaluationFailures) resolved: (8) Prometheus rule evaluation failures (instance prometheus1005:9900) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [09:23:44] (PuppetConstantChange) firing: (2) Puppet performing a change on every puppet run - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [09:24:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1126.eqiad.wmnet with reason: Maintenance [09:24:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1126.eqiad.wmnet with reason: Maintenance [09:24:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 17 hosts with reason: Maintenance [09:24:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 17 hosts with reason: Maintenance [09:25:01] (03PS12) 10Brouberol: [Kafka] Use broker in-sync status as a gate between broker restarts [cookbooks] - 10https://gerrit.wikimedia.org/r/959720 (https://phabricator.wikimedia.org/T346741) [09:25:34] (KubernetesCalicoDown) resolved: (6) kubernetes1028.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:26:13] (03CR) 10Ayounsi: [C: 03+1] Temporarily adjust EVPN outbound policy to CRs to block existing nets [homer/public] - 10https://gerrit.wikimedia.org/r/960109 (https://phabricator.wikimedia.org/T347191) (owner: 10Cathal Mooney) [09:26:27] (03CR) 10Brouberol: "@" [cookbooks] - 10https://gerrit.wikimedia.org/r/959720 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol) [09:27:54] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:28:49] Amir1: sorry...day off here. Nope, just merge it and puppet will take care [09:29:01] awesome. thanks. [09:29:19] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43508/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/958981 (https://phabricator.wikimedia.org/T346893) (owner: 10Cwhite) [09:30:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance [09:30:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance [09:30:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db[1137,1216,1220,1225].eqiad.wmnet,dbstore1005.eqiad.wmnet with reason: Maintenance [09:30:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db[1137,1216,1220,1225].eqiad.wmnet,dbstore1005.eqiad.wmnet with reason: Maintenance [09:33:34] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1030.eqiad.wmnet with reason: host reimage [09:33:52] (03CR) 10EoghanGaffney: [C: 03+1] switch peopleweb from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/959693 (https://phabricator.wikimedia.org/T345618) (owner: 10Jelto) [09:33:59] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1032.eqiad.wmnet with reason: host reimage [09:34:08] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1031.eqiad.wmnet with reason: host reimage [09:34:17] (03CR) 10EoghanGaffney: [C: 03+1] peopleweb: switch rsync source and dest between eqiad and codfw [puppet] - 10https://gerrit.wikimedia.org/r/959690 (https://phabricator.wikimedia.org/T345618) (owner: 10Jelto) [09:36:45] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1030.eqiad.wmnet with reason: host reimage [09:37:16] (03CR) 10Ayounsi: Support configuration of EVPN anycast GW on switches (033 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/959873 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney) [09:38:36] !log switch people.wikimedia.org to codfw - T345618 [09:38:39] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1032.eqiad.wmnet with reason: host reimage [09:38:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:42] T345618: Switchover people.wikimedia.org - September 2023 - https://phabricator.wikimedia.org/T345618 [09:41:21] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1031.eqiad.wmnet with reason: host reimage [09:43:00] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on puppetdb1002.eqiad.wmnet with reason: Disable puppetdb/postgres/nginx on old nodes to ensure nothing hits them anyway [09:43:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on puppetdb1002.eqiad.wmnet with reason: Disable puppetdb/postgres/nginx on old nodes to ensure nothing hits them anyway [09:43:27] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=73525cca-1535-4d44-89d8-fcd584ea67a9) set by jmm@cumin2002 for... [09:43:27] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on puppetdb2002.codfw.wmnet with reason: Disable puppetdb/postgres/nginx on old nodes to ensure nothing hits them anyway [09:43:33] (03CR) 10Volans: "Thanks for migrating to the batch classes, the approach looks sane, few suggestions inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/959720 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol) [09:43:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on puppetdb2002.codfw.wmnet with reason: Disable puppetdb/postgres/nginx on old nodes to ensure nothing hits them anyway [09:43:53] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=69921077-8a56-48de-9905-0d3d1b91d292) set by jmm@cumin2002 for... [09:45:31] 10SRE, 10Data-Persistence, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10Jelto) [09:45:44] (03CR) 10Hashar: [C: 03+1] "With tox v3, I have confirmed there is no change in the configuration (using `tox --showconf`) and all environments have `usedevelop`:" [software/conftool] - 10https://gerrit.wikimedia.org/r/960068 (https://phabricator.wikimedia.org/T346238) (owner: 10Hashar) [09:46:33] (03PS1) 10Urbanecm: AddImageFeedbackHandler: Add missing parameters [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959993 (https://phabricator.wikimedia.org/T346277) [09:47:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1020.eqiad.wmnet with reason: Maintenance [09:47:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1020.eqiad.wmnet with reason: Maintenance [09:47:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on es[1021-1022].eqiad.wmnet with reason: Maintenance [09:47:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on es[1021-1022].eqiad.wmnet with reason: Maintenance [09:49:27] (PrometheusRuleEvaluationFailures) firing: (4) Prometheus rule evaluation failures (instance prometheus1006:9900) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [09:50:33] jouncebot: nowandnext [09:50:33] No deployments scheduled for the next 0 hour(s) and 9 minute(s) [09:50:33] In 0 hour(s) and 9 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230925T1000) [09:51:28] (03PS2) 10JMeybohm: prometheus::k8s: Drop puppet class names [puppet] - 10https://gerrit.wikimedia.org/r/960055 (https://phabricator.wikimedia.org/T346915) [09:52:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1022.eqiad.wmnet with reason: Maintenance [09:52:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1022.eqiad.wmnet with reason: Maintenance [09:52:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1022 (T344589)', diff saved to https://phabricator.wikimedia.org/P52596 and previous config saved to /var/cache/conftool/dbconfig/20230925-095235-ladsgroup.json [09:53:41] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1030.eqiad.wmnet with OS bullseye [09:53:58] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [09:53:58] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:54:04] (03CR) 10JMeybohm: [C: 03+2] prometheus::k8s: Drop puppet class names [puppet] - 10https://gerrit.wikimedia.org/r/960055 (https://phabricator.wikimedia.org/T346915) (owner: 10JMeybohm) [09:54:27] (PrometheusRuleEvaluationFailures) resolved: (7) Prometheus rule evaluation failures (instance prometheus1005:9900) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [09:55:54] (03PS1) 10Mhorsey: Enable Campaigns email on test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960559 (https://phabricator.wikimedia.org/T347065) [09:56:30] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:56:44] (03CR) 10CI reject: [V: 04-1] Enable Campaigns email on test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960559 (https://phabricator.wikimedia.org/T347065) (owner: 10Mhorsey) [09:56:57] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1032.eqiad.wmnet with OS bullseye [09:58:38] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services: Upgrade etherpad.wikimedia.org to v1.9.3 - https://phabricator.wikimedia.org/T316421 (10LSobanski) [09:59:36] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services: Upgrade etherpad.wikimedia.org to v1.9.3 - https://phabricator.wikimedia.org/T316421 (10LSobanski) I updated the description to reflect the new Etherpad release (1.9.2). See below for a list of changes: * Compability changes ** express-rate-limit has be... [09:59:40] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1031.eqiad.wmnet with OS bullseye [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230925T1000) [10:00:07] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/959981 [10:01:59] (03PS2) 10Mhorsey: Enable Campaigns email on test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960559 (https://phabricator.wikimedia.org/T347065) [10:03:41] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1033.eqiad.wmnet with OS bullseye [10:03:52] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1034.eqiad.wmnet with OS bullseye [10:03:59] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1035.eqiad.wmnet with OS bullseye [10:04:05] (03PS13) 10Brouberol: [sre.kafka] Use broker in-sync status as a gate between broker restarts [cookbooks] - 10https://gerrit.wikimedia.org/r/959720 (https://phabricator.wikimedia.org/T346741) [10:04:08] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1036.eqiad.wmnet with OS bullseye [10:04:15] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1037.eqiad.wmnet with OS bullseye [10:04:23] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1038.eqiad.wmnet with OS bullseye [10:04:31] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1036.eqiad.wmnet with OS bullseye [10:04:35] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1039.eqiad.wmnet with OS bullseye [10:04:45] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1040.eqiad.wmnet with OS bullseye [10:04:46] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1038.eqiad.wmnet with OS bullseye [10:04:49] (03CR) 10Brouberol: "Thanks for the review Volans! I tried to address your remarks, questions and nits!" [cookbooks] - 10https://gerrit.wikimedia.org/r/959720 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol) [10:05:04] (03CR) 10Mhorsey: [C: 04-1] "DO NOT MERGE UNTIL RELEASE" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960559 (https://phabricator.wikimedia.org/T347065) (owner: 10Mhorsey) [10:05:06] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1041.eqiad.wmnet with OS bullseye [10:05:18] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1042.eqiad.wmnet with OS bullseye [10:06:15] (03PS1) 10Filippo Giunchedi: o11y: add some leeway for PrometheusRuleEvaluationFailures [alerts] - 10https://gerrit.wikimedia.org/r/960560 (https://phabricator.wikimedia.org/T347167) [10:07:27] (03CR) 10CI reject: [V: 04-1] [sre.kafka] Use broker in-sync status as a gate between broker restarts [cookbooks] - 10https://gerrit.wikimedia.org/r/959720 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol) [10:08:10] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1036.eqiad.wmnet with OS bullseye [10:08:18] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1036.eqiad.wmnet with OS bullseye [10:09:27] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1038.eqiad.wmnet with OS bullseye [10:09:35] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1038.eqiad.wmnet with OS bullseye [10:10:02] (03CR) 10Filippo Giunchedi: [C: 03+2] o11y: add some leeway for PrometheusRuleEvaluationFailures [alerts] - 10https://gerrit.wikimedia.org/r/960560 (https://phabricator.wikimedia.org/T347167) (owner: 10Filippo Giunchedi) [10:12:39] (03CR) 10Filippo Giunchedi: P:prometheus::ops: convert to using wmflib::get_clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [10:17:34] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1033.eqiad.wmnet with reason: host reimage [10:17:44] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1034.eqiad.wmnet with reason: host reimage [10:17:50] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1035.eqiad.wmnet with reason: host reimage [10:18:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1022 (T344589)', diff saved to https://phabricator.wikimedia.org/P52597 and previous config saved to /var/cache/conftool/dbconfig/20230925-101824-ladsgroup.json [10:18:25] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1037.eqiad.wmnet with reason: host reimage [10:18:44] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1040.eqiad.wmnet with reason: host reimage [10:18:50] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1039.eqiad.wmnet with reason: host reimage [10:19:14] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1041.eqiad.wmnet with reason: host reimage [10:19:18] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1042.eqiad.wmnet with reason: host reimage [10:20:06] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1040.eqiad.wmnet with reason: host reimage [10:20:07] (03CR) 10Cathal Mooney: Support configuration of EVPN anycast GW on switches (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/959873 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney) [10:20:07] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1033.eqiad.wmnet with reason: host reimage [10:22:33] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1035.eqiad.wmnet with reason: host reimage [10:23:04] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1041.eqiad.wmnet with reason: host reimage [10:23:38] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1043.eqiad.wmnet with OS bullseye [10:23:52] (03CR) 10Cathal Mooney: Support configuration of EVPN anycast GW on switches (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/959873 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney) [10:23:57] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1052.eqiad.wmnet with OS bullseye [10:24:19] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1044.eqiad.wmnet with OS bullseye [10:25:00] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1039.eqiad.wmnet with reason: host reimage [10:25:00] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1045.eqiad.wmnet with OS bullseye [10:25:03] (03CR) 10Cathal Mooney: Support configuration of EVPN anycast GW on switches (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/959873 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney) [10:25:13] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1046.eqiad.wmnet with OS bullseye [10:25:32] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1047.eqiad.wmnet with OS bullseye [10:25:47] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1048.eqiad.wmnet with OS bullseye [10:25:55] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1047.eqiad.wmnet with OS bullseye [10:25:59] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1049.eqiad.wmnet with OS bullseye [10:26:13] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1050.eqiad.wmnet with OS bullseye [10:26:26] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1051.eqiad.wmnet with OS bullseye [10:27:00] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1053.eqiad.wmnet with OS bullseye [10:27:09] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1054.eqiad.wmnet with OS bullseye [10:27:12] (03CR) 10Cathal Mooney: Support configuration of EVPN anycast GW on switches (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/959873 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney) [10:27:20] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1056.eqiad.wmnet with OS bullseye [10:27:27] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1037.eqiad.wmnet with reason: host reimage [10:27:32] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1055.eqiad.wmnet with OS bullseye [10:27:34] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1042.eqiad.wmnet with reason: host reimage [10:27:45] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1034.eqiad.wmnet with reason: host reimage [10:29:01] (03CR) 10Cathal Mooney: [C: 03+2] Temporarily adjust EVPN outbound policy to CRs to block existing nets [homer/public] - 10https://gerrit.wikimedia.org/r/960109 (https://phabricator.wikimedia.org/T347191) (owner: 10Cathal Mooney) [10:29:34] (03Merged) 10jenkins-bot: Temporarily adjust EVPN outbound policy to CRs to block existing nets [homer/public] - 10https://gerrit.wikimedia.org/r/960109 (https://phabricator.wikimedia.org/T347191) (owner: 10Cathal Mooney) [10:29:55] (03CR) 10Urbanecm: [C: 04-2] "pending deployment date definition" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960545 (https://phabricator.wikimedia.org/T345940) (owner: 10Urbanecm) [10:31:43] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1047.eqiad.wmnet with OS bullseye [10:33:10] PROBLEM - Host kubernetes1040 is DOWN: PING CRITICAL - Packet loss = 100% [10:33:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1022', diff saved to https://phabricator.wikimedia.org/P52599 and previous config saved to /var/cache/conftool/dbconfig/20230925-103330-ladsgroup.json [10:33:36] (03PS1) 10Marostegui: Revert "db1128: Disable notification" [puppet] - 10https://gerrit.wikimedia.org/r/959994 [10:34:15] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1036.eqiad.wmnet with OS bullseye [10:34:21] (03CR) 10Marostegui: [C: 03+2] Revert "db1128: Disable notification" [puppet] - 10https://gerrit.wikimedia.org/r/959994 (owner: 10Marostegui) [10:34:36] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1038.eqiad.wmnet with OS bullseye [10:34:38] effie: I'm a bit afraid that you're running them too closely between each other and many will fail to downtime and potentially other steps [10:34:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1128', diff saved to https://phabricator.wikimedia.org/P52600 and previous config saved to /var/cache/conftool/dbconfig/20230925-103454-root.json [10:35:34] (KubernetesCalicoDown) firing: kubernetes1035.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=kubernetes1035.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:35:48] RECOVERY - Host kubernetes1040 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [10:36:35] volans: I realised it quite late and teh hard way [10:36:53] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1042.eqiad.wmnet with OS bullseye [10:37:13] each downtime during reimage requires a puppet run on the active alert host and each run take ~2.5 minutes [10:37:16] *akes [10:37:30] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1040.eqiad.wmnet with OS bullseye [10:37:38] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1043.eqiad.wmnet with reason: host reimage [10:38:15] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1044.eqiad.wmnet with reason: host reimage [10:38:50] volans: we will take the alert hit, I will prep a patch to add that info in the help info, as it gets forgotten all the time [10:38:54] sorry about that [10:38:56] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1045.eqiad.wmnet with reason: host reimage [10:39:07] (JobUnavailable) firing: (4) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:39:09] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1046.eqiad.wmnet with reason: host reimage [10:39:42] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1041.eqiad.wmnet with OS bullseye [10:39:43] effie: thanks! FYI we'll shortly have locking support so we would be able to avoid some of those failure [10:40:02] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1052.eqiad.wmnet with reason: host reimage [10:40:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:40:08] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1048.eqiad.wmnet with reason: host reimage [10:40:14] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1049.eqiad.wmnet with reason: host reimage [10:40:25] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1050.eqiad.wmnet with reason: host reimage [10:40:34] (KubernetesCalicoDown) firing: (2) kubernetes1034.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:40:42] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1043.eqiad.wmnet with reason: host reimage [10:40:46] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1035.eqiad.wmnet with OS bullseye [10:40:56] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host deploy1002.eqiad.wmnet [10:41:05] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1033.eqiad.wmnet with OS bullseye [10:42:03] 10SRE-swift-storage: Swift-recon -d overstates disk capacity and usage - https://phabricator.wikimedia.org/T294016 (10MatthewVernon) 05Open→03Resolved Resolved by deploying `2.26.0-10+deb11u1+wmf1` fleet-wide. [10:42:25] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1051.eqiad.wmnet with reason: host reimage [10:42:56] volans: <3 [10:42:58] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1053.eqiad.wmnet with reason: host reimage [10:43:08] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1045.eqiad.wmnet with reason: host reimage [10:43:18] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1039.eqiad.wmnet with OS bullseye [10:43:24] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1055.eqiad.wmnet with reason: host reimage [10:43:30] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1054.eqiad.wmnet with reason: host reimage [10:43:31] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1056.eqiad.wmnet with reason: host reimage [10:45:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:45:06] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1049.eqiad.wmnet with reason: host reimage [10:45:22] (03PS1) 10Elukey: icinga/nagios: remove check_ores* [puppet] - 10https://gerrit.wikimedia.org/r/960567 (https://phabricator.wikimedia.org/T347278) [10:45:28] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1052.eqiad.wmnet with reason: host reimage [10:45:41] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1047.eqiad.wmnet with reason: host reimage [10:46:35] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1037.eqiad.wmnet with OS bullseye [10:46:50] 10SRE, 10SRE-swift-storage: Swiftrepl was stuck in an infinite loop since days - https://phabricator.wikimedia.org/T162122 (10MatthewVernon) 05Stalled→03Resolved a:03MatthewVernon We don't use swiftrepl any more, so closing this. [10:47:37] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1050.eqiad.wmnet with reason: host reimage [10:47:37] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1051.eqiad.wmnet with reason: host reimage [10:47:52] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1036.eqiad.wmnet with reason: host reimage [10:48:10] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1055.eqiad.wmnet with reason: host reimage [10:48:15] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1044.eqiad.wmnet with reason: host reimage [10:48:17] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1038.eqiad.wmnet with reason: host reimage [10:48:37] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1034.eqiad.wmnet with OS bullseye [10:48:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1022', diff saved to https://phabricator.wikimedia.org/P52601 and previous config saved to /var/cache/conftool/dbconfig/20230925-104837-ladsgroup.json [10:49:08] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:49:09] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1046.eqiad.wmnet with reason: host reimage [10:49:27] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host deploy1002.eqiad.wmnet [10:50:08] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1048.eqiad.wmnet with reason: host reimage [10:50:10] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1053.eqiad.wmnet with reason: host reimage [10:50:34] (KubernetesCalicoDown) firing: (5) kubernetes1034.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:50:35] (03PS5) 10Giuseppe Lavagetto: modules: add base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/959803 (https://phabricator.wikimedia.org/T343025) [10:50:37] (03PS1) 10Giuseppe Lavagetto: mediawiki: add support for a prometheus-statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/960568 (https://phabricator.wikimedia.org/T343025) [10:50:58] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:51:24] (03CR) 10CI reject: [V: 04-1] mediawiki: add support for a prometheus-statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/960568 (https://phabricator.wikimedia.org/T343025) (owner: 10Giuseppe Lavagetto) [10:51:26] (03CR) 10Volans: [sre.kafka] Use broker in-sync status as a gate between broker restarts (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/959720 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol) [10:51:39] (KeyholderUnarmed) firing: 19 unarmed Keyholder key(s) on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [10:52:06] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:52:08] keyholder error expected, fixing [10:52:12] PROBLEM - Check size of conntrack table on kubernetes1051 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.132.28: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [10:52:20] PROBLEM - DPKG on kubernetes1049 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.132.26: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:52:33] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1047.eqiad.wmnet with reason: host reimage [10:52:34] 10SRE, 10SRE-swift-storage, 10Thumbor, 10Traffic: Cache thumbs in our caching infrastructure (e.g. ATS) - https://phabricator.wikimedia.org/T345334 (10MatthewVernon) > An interesting data point (that I didn't see directly in the other ticket, at least in a quick scan!) would be some idea of the curve of "i... [10:52:55] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1038.eqiad.wmnet with reason: host reimage [10:53:25] PROBLEM - Check for large files in client bucket on kubernetes1046 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.166: Connection reset by peer https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [10:53:30] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1054.eqiad.wmnet with reason: host reimage [10:53:31] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1056.eqiad.wmnet with reason: host reimage [10:54:21] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1048 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.132.25: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:54:45] (03PS1) 10Elukey: Avoid pages for ores.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/960569 (https://phabricator.wikimedia.org/T347278) [10:54:49] PROBLEM - Check for large files in client bucket on kubernetes1051 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.132.28: Connection reset by peer https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [10:54:55] PROBLEM - Check the NTP synchronisation status of timesyncd on kubernetes1051 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.132.28: Connection reset by peer https://wikitech.wikimedia.org/wiki/NTP [10:54:56] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host poolcounter1004.eqiad.wmnet [10:55:15] PROBLEM - confd service on kubernetes1044 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.164: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:55:19] PROBLEM - Check systemd state on kubernetes1051 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:55:27] RECOVERY - Check size of conntrack table on kubernetes1051 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [10:55:34] (KubernetesCalicoDown) firing: (7) kubernetes1034.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:55:49] RECOVERY - Check for large files in client bucket on kubernetes1051 is OK: OK: client bucket file ok https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [10:55:57] PROBLEM - Check systemd state on kubernetes1046 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.166: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:56:17] RECOVERY - confd service on kubernetes1044 is OK: OK - confd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:56:23] RECOVERY - Check systemd state on kubernetes1051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:56:30] 10SRE-swift-storage: flip/flop mounting filesystems between systemd and swift-drive-audit - https://phabricator.wikimedia.org/T265450 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Fixed with roll-out of swift version `2.26.0-10+deb11u1+wmf1` fleet-wide. [10:56:39] (KeyholderUnarmed) resolved: 19 unarmed Keyholder key(s) on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [10:56:39] RECOVERY - Check for large files in client bucket on kubernetes1046 is OK: OK: client bucket file ok https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [10:57:01] PROBLEM - Check systemd state on kubernetes1048 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.132.25: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:57:11] PROBLEM - Check size of conntrack table on kubernetes1048 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.132.25. Check system logs on 10.64.132.25 https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [10:57:16] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1051.eqiad.wmnet with OS bullseye [10:57:21] PROBLEM - Host kubernetes1049 is DOWN: PING CRITICAL - Packet loss = 100% [10:57:29] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1036.eqiad.wmnet with reason: host reimage [10:57:55] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1055.eqiad.wmnet with OS bullseye [10:57:59] RECOVERY - Check systemd state on kubernetes1046 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:57:59] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1045.eqiad.wmnet with OS bullseye [10:58:11] RECOVERY - Check size of conntrack table on kubernetes1048 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [10:58:17] (03PS6) 10Giuseppe Lavagetto: modules: add base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/959803 (https://phabricator.wikimedia.org/T343025) [10:58:19] (03PS2) 10Giuseppe Lavagetto: mediawiki: add support for a prometheus-statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/960568 (https://phabricator.wikimedia.org/T343025) [10:58:37] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host poolcounter1004.eqiad.wmnet [10:58:48] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1043.eqiad.wmnet with OS bullseye [10:59:06] (03CR) 10CI reject: [V: 04-1] mediawiki: add support for a prometheus-statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/960568 (https://phabricator.wikimedia.org/T343025) (owner: 10Giuseppe Lavagetto) [10:59:07] RECOVERY - Check systemd state on kubernetes1048 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:59:25] PROBLEM - Check systemd state on kubernetes1047 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.132.29. Check system logs on 10.64.132.29 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:59:37] RECOVERY - Host kubernetes1049 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [11:00:19] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST customresourcedefinitions) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:00:27] (03PS3) 10Kamila Součková: geo-maps: reorder codfw/eqiad in the default [dns] - 10https://gerrit.wikimedia.org/r/959182 (https://phabricator.wikimedia.org/T346474) [11:00:28] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host poolcounter1005.eqiad.wmnet [11:00:34] (KubernetesCalicoDown) firing: (6) kubernetes1034.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:01:05] PROBLEM - Host kubernetes1044 is DOWN: PING CRITICAL - Packet loss = 100% [11:01:05] PROBLEM - Check for large files in client bucket on kubernetes1056 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.136.27: Connection reset by peer https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [11:01:09] PROBLEM - Host kubernetes1050 is DOWN: PING CRITICAL - Packet loss = 100% [11:01:51] PROBLEM - Host kubernetes1048 is DOWN: PING CRITICAL - Packet loss = 100% [11:02:15] RECOVERY - Check for large files in client bucket on kubernetes1056 is OK: OK: client bucket file ok https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [11:02:35] RECOVERY - Check systemd state on kubernetes1047 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:02:53] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1049 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:02:53] RECOVERY - DPKG on kubernetes1049 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:03:09] PROBLEM - Host kubernetes1046 is DOWN: PING CRITICAL - Packet loss = 100% [11:03:16] (03PS2) 10Hnowlan: trafficserver: route knowledge-gap path via rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/946928 (https://phabricator.wikimedia.org/T342213) [11:03:16] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1056.eqiad.wmnet with OS bullseye [11:03:25] RECOVERY - Host kubernetes1050 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [11:03:40] (03PS2) 10Hnowlan: trafficserver: route requests to mediarequests service [puppet] - 10https://gerrit.wikimedia.org/r/956909 (https://phabricator.wikimedia.org/T336380) [11:03:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1022 (T344589)', diff saved to https://phabricator.wikimedia.org/P52602 and previous config saved to /var/cache/conftool/dbconfig/20230925-110343-ladsgroup.json [11:04:01] RECOVERY - Host kubernetes1044 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [11:04:33] RECOVERY - Host kubernetes1048 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [11:04:46] (03PS2) 10Hnowlan: api-gateway: emit cache-control header for 404s [deployment-charts] - 10https://gerrit.wikimedia.org/r/956833 (https://phabricator.wikimedia.org/T336400) [11:05:13] PROBLEM - Check systemd state on kubernetes1049 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:05:15] PROBLEM - Host kubernetes1053 is DOWN: PING CRITICAL - Packet loss = 100% [11:05:19] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (LIST customresourcedefinitions) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:05:24] 10SRE, 10GrowthExperiments-Homepage, 10GrowthExperiments-ImpactModule, 10serviceops, and 2 others: RefreshUserImpactJob consumes too many file descriptors - https://phabricator.wikimedia.org/T344428 (10LSobanski) [11:05:25] RECOVERY - Host kubernetes1046 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [11:05:30] (03PS1) 10Majavah: hieradata: update ns0.openstack address [puppet] - 10https://gerrit.wikimedia.org/r/960570 [11:05:34] (KubernetesCalicoDown) firing: (5) kubernetes1050.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:05:34] 10SRE, 10Traffic: Add README and build-specific Dockerfile to purged - https://phabricator.wikimedia.org/T347021 (10LSobanski) [11:05:49] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1048 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:05:52] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host poolcounter1005.eqiad.wmnet [11:06:01] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1048.eqiad.wmnet with OS bullseye [11:06:27] PROBLEM - Host kubernetes1047 is DOWN: PING CRITICAL - Packet loss = 100% [11:07:23] RECOVERY - Host kubernetes1053 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [11:07:29] 10SRE, 10SRE-swift-storage, 10Traffic-Icebox, 10Wikimedia-Performance-recommendation, 10affects-Kiwix-and-openZIM: Swift sends ETAG without double-quotes - https://phabricator.wikimedia.org/T256217 (10MatthewVernon) I can confirm that we're running a new-enough swift version everywhere that we //could//... [11:07:33] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1044.eqiad.wmnet with OS bullseye [11:08:25] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1049.eqiad.wmnet with OS bullseye [11:08:33] RECOVERY - Host kubernetes1047 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [11:08:35] PROBLEM - Host kubernetes1054 is DOWN: PING CRITICAL - Packet loss = 100% [11:08:47] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1052.eqiad.wmnet with OS bullseye [11:09:03] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1046.eqiad.wmnet with OS bullseye [11:09:35] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1050 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:09:35] PROBLEM - Check systemd state on kubernetes1050 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:10:34] (KubernetesCalicoDown) resolved: (5) kubernetes1050.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:10:45] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1047 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:10:45] PROBLEM - Check systemd state on kubernetes1047 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:10:49] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1053 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:11:22] (03PS1) 10Hnowlan: rest-gateway: only pass requests for knowledge-gap on wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/960575 (https://phabricator.wikimedia.org/T342213) [11:11:27] (03PS1) 10Majavah: hieradata: remove more cloudmetrics1003 references [puppet] - 10https://gerrit.wikimedia.org/r/960576 [11:11:47] RECOVERY - Host kubernetes1054 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [11:12:31] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1050.eqiad.wmnet with OS bullseye [11:13:05] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1038.eqiad.wmnet with OS bullseye [11:13:26] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb1014.eqiad.wmnet [11:13:27] PROBLEM - Check systemd state on kubernetes1053 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:13:31] (03PS2) 10Majavah: hieradata: remove more cloudmetrics1003 references [puppet] - 10https://gerrit.wikimedia.org/r/960576 (https://phabricator.wikimedia.org/T326266) [11:14:57] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:16:06] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1047.eqiad.wmnet with OS bullseye [11:16:12] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1053.eqiad.wmnet with OS bullseye [11:16:25] PROBLEM - Check systemd state on kubernetes1054 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:16:47] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1036.eqiad.wmnet with OS bullseye [11:17:47] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1054 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:19:25] 10SRE, 10ExternalGuidance, 10Language-Team, 10Traffic-Icebox: Deliver mobile-based version for automatic translations - https://phabricator.wikimedia.org/T212197 (10Pginer-WMF) 05Open→03Resolved a:03Pginer-WMF I think the task can be closed, and focus future efforts in {T280430} The new Vector skin... [11:19:58] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1014.eqiad.wmnet [11:20:17] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb1013.eqiad.wmnet [11:20:22] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1054.eqiad.wmnet with OS bullseye [11:25:24] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1013.eqiad.wmnet [11:26:46] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb1012.eqiad.wmnet [11:33:02] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1012.eqiad.wmnet [11:33:50] !log jayme@cumin1001 START - Cookbook sre.hosts.remove-downtime for 30 hosts [11:34:00] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 30 hosts [11:35:54] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1052 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:36:24] RECOVERY - Check systemd state on kubernetes1047 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:36:24] RECOVERY - Check systemd state on kubernetes1049 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:36:24] RECOVERY - Check systemd state on kubernetes1054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:36:24] RECOVERY - Check systemd state on kubernetes1050 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:36:24] RECOVERY - Check systemd state on kubernetes1053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:36:33] 10SRE, 10GrowthExperiments-Homepage, 10GrowthExperiments-ImpactModule, 10serviceops, and 2 others: RefreshUserImpactJob consumes too many file descriptors - https://phabricator.wikimedia.org/T344428 (10Urbanecm_WMF) Thanks for the advice @joe! > What I fail to understand is how, if this was an open file l... [11:36:46] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1047 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:36:46] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1049 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:36:46] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1050 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:36:46] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1052 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:36:46] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1053 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:36:47] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1054 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:37:41] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb1011.eqiad.wmnet [11:38:14] (03PS3) 10JMeybohm: conftool: add new k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/958811 (https://phabricator.wikimedia.org/T346714) (owner: 10Giuseppe Lavagetto) [11:40:53] (03CR) 10JMeybohm: [C: 03+2] conftool: add new k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/958811 (https://phabricator.wikimedia.org/T346714) (owner: 10Giuseppe Lavagetto) [11:41:01] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] kubernetes::node: Reserve CPU resources for system daemons [puppet] - 10https://gerrit.wikimedia.org/r/959164 (https://phabricator.wikimedia.org/T277876) (owner: 10JMeybohm) [11:41:07] (ProbeDown) firing: Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:41:31] <_joe_> uh [11:41:37] ^ is this maintenance? [11:41:40] I will ack [11:41:48] here too [11:41:52] PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:41:53] checking [11:41:55] <_joe_> jynus: not that i know of [11:42:08] !incidents [11:42:12] <_joe_> is it eqiad? [11:42:23] yeah, eqiad and recovering [11:42:31] it was acked [11:42:39] ip4 eqiad [11:42:43] !log running puppet on lvs in codfw - T346714 [11:42:48] <_joe_> yeah it's probably due to the rdb server reboot [11:42:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:51] T346714: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 [11:43:04] <_joe_> jayme: I guess wrong DC? [11:43:10] !log running puppet on lvs in eqiad - T346714 (TYPO from above, did not run in codfw) [11:43:14] nope :) [11:43:15] (03CR) 10Ladsgroup: "would it make sense to remove the probe as well? to remove the health checks" [puppet] - 10https://gerrit.wikimedia.org/r/960569 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [11:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:56] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1011.eqiad.wmnet [11:44:10] RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:45:05] I'll take a look at the logs [11:45:51] Yeah docker is probably the rdb reboot [11:46:07] (ProbeDown) resolved: Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:46:15] resolved, nice [11:46:26] Yeah, done with that pair of reboots [11:46:36] I guess it'll fail again when I do the same pair in codfw [11:47:26] back in a nit [11:47:28] bit [11:50:36] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [11:50:37] correct me if I am wrong, registry should only impact mw image rebuilds and things like that, no direct user impact? [11:50:50] e.g. deploys, right? [11:51:01] (03PS1) 10JMeybohm: mw-api-ext/mw-web: Raise main replicas to 16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/960591 [11:52:15] (03CR) 10Filippo Giunchedi: icinga/nagios: remove check_ores* (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960567 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [11:52:40] jynus: it's accessible from external as well, so one could say there might be user impact [11:52:51] I see, thanks [11:54:08] not complaining about it alerting, just tried to asess impact- sometimes some dependencies create greater fallout than initially expected [11:56:26] jynus: for context, apparently docker-registry doesn't do redis ha [11:56:57] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/960576 (https://phabricator.wikimedia.org/T326266) (owner: 10Majavah) [11:57:08] (03CR) 10JMeybohm: [C: 03+2] mw-api-ext/mw-web: Raise main replicas to 16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/960591 (owner: 10JMeybohm) [11:57:24] jynus: Do you want me to downtime it before rebooting its pair in codfw? [11:57:54] (03Merged) 10jenkins-bot: mw-api-ext/mw-web: Raise main replicas to 16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/960591 (owner: 10JMeybohm) [11:58:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10JMeybohm) [11:58:52] claime: no worries on my side, if I know it is going to happen- just be quick to ack on splunk, so it doesn't p* everbody :-D [11:59:03] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-53] - https://phabricator.wikimedia.org/T342534 (10JMeybohm) [11:59:17] jynus: ack, sorry for the bother [11:59:39] no issues caused, alerts are there to happen! [12:00:13] I'll do that after my lunch [12:00:17] :) [12:01:01] (03CR) 10Majavah: [C: 03+2] hieradata: remove more cloudmetrics1003 references [puppet] - 10https://gerrit.wikimedia.org/r/960576 (https://phabricator.wikimedia.org/T326266) (owner: 10Majavah) [12:01:39] jouncebot: nowandnext [12:01:39] No deployments scheduled for the next 0 hour(s) and 58 minute(s) [12:01:39] In 0 hour(s) and 58 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230925T1300) [12:01:55] I am more interested on seeing downtimed non-useful alerts like those for new hosts that are WIP or long running maintenance, that only adds noise to observability (while the docker-registry was a real issue) [12:02:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:02:35] (03PS1) 10Hnowlan: mobileapps: increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/960596 [12:03:07] and sometimes it is useful to see things go down and up correctly! [12:07:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:08:06] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:08:16] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [12:10:33] (03PS1) 10Filippo Giunchedi: pontoon: update o11y rolemap [puppet] - 10https://gerrit.wikimedia.org/r/960599 [12:11:21] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/960548 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [12:12:29] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: update o11y rolemap [puppet] - 10https://gerrit.wikimedia.org/r/960599 (owner: 10Filippo Giunchedi) [12:12:48] (03PS1) 10Jbond: puppet_compiler: roll back to 2.5.6 [puppet] - 10https://gerrit.wikimedia.org/r/960600 (https://phabricator.wikimedia.org/T346216) [12:13:52] (03CR) 10Jbond: [C: 03+2] puppet_compiler: roll back to 2.5.6 [puppet] - 10https://gerrit.wikimedia.org/r/960600 (https://phabricator.wikimedia.org/T346216) (owner: 10Jbond) [12:14:25] godog: happy for me to merge yours [12:14:33] jbond: oops! yes thank you, I forgot [12:14:39] done [12:14:44] cheers [12:16:16] !log jayme@deploy2002 Started scap: (no justification provided) [12:16:30] (03PS1) 10Majavah: icinga: Drop monitoring for *.wmcloud.org certificate [puppet] - 10https://gerrit.wikimedia.org/r/960601 (https://phabricator.wikimedia.org/T345983) [12:16:32] (03PS1) 10Majavah: nagios_common: drop unused contact group [puppet] - 10https://gerrit.wikimedia.org/r/960602 [12:17:13] !log bumping k8s deployment mw-web and mw-api-ext to 16 replicas each in both DCs [12:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:53] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): deomission puppetdb[12]002 - https://phabricator.wikimedia.org/T347285 (10jbond) [12:18:59] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): decomission puppetboard[12]002 - https://phabricator.wikimedia.org/T347286 (10jbond) [12:19:54] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:20:00] (03PS1) 10Muehlenhoff: dragonfly::supernode: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/960603 [12:20:24] (03CR) 10CI reject: [V: 04-1] dragonfly::supernode: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/960603 (owner: 10Muehlenhoff) [12:22:50] 10ops-eqiad, 10DC-Ops: Replace RAID controller battery on an-worker1086 - https://phabricator.wikimedia.org/T347287 (10BTullis) [12:22:53] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] hieradata: update ns0.openstack address [puppet] - 10https://gerrit.wikimedia.org/r/960570 (owner: 10Majavah) [12:23:08] (03CR) 10Majavah: [C: 03+2] hieradata: update ns0.openstack address [puppet] - 10https://gerrit.wikimedia.org/r/960570 (owner: 10Majavah) [12:23:32] (03CR) 10JMeybohm: [C: 03+1] mobileapps: increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/960596 (owner: 10Hnowlan) [12:23:52] 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Replace RAID controller battery on an-worker1086 - https://phabricator.wikimedia.org/T347287 (10BTullis) [12:24:08] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - mwdebug_4444: Servers kubernetes1012.eqiad.wmnet, kubernetes1019.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1021.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:24:12] 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Replace RAID controller battery on an-worker1086 - https://phabricator.wikimedia.org/T347287 (10BTullis) p:05Triage→03Medium [12:25:27] (03PS2) 10FNegri: Package for Debian Bookworm [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/959212 (https://phabricator.wikimedia.org/T346762) [12:25:29] (03PS4) 10FNegri: d/changelog: bump version [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/959316 (owner: 10David Caro) [12:26:25] !log jayme@deploy2002 Finished scap: (no justification provided) (duration: 10m 08s) [12:26:34] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - mwdebug_4444: Servers kubernetes1022.eqiad.wmnet, kubernetes1019.eqiad.wmnet, kubernetes1010.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1015.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:28:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:28:49] (03PS2) 10Muehlenhoff: dragonfly::supernode: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/960603 [12:29:43] (03CR) 10JMeybohm: [C: 03+1] otel-coll: enable prometheus scraping [deployment-charts] - 10https://gerrit.wikimedia.org/r/960056 (https://phabricator.wikimedia.org/T345712) (owner: 10Filippo Giunchedi) [12:29:52] (03CR) 10Muehlenhoff: [C: 03+1] "Didn't test a build, but looks good to me in general" [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/959212 (https://phabricator.wikimedia.org/T346762) (owner: 10FNegri) [12:30:07] (ProbeDown) firing: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:31:07] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959179 (owner: 10Muehlenhoff) [12:33:34] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:33:54] (03CR) 10Hashar: python-build: provide a python2 Bullseye image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940161 (https://phabricator.wikimedia.org/T342346) (owner: 10Hashar) [12:35:07] (ProbeDown) resolved: (2) Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:36:31] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/960603 (owner: 10Muehlenhoff) [12:38:17] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/960162 (owner: 10Majavah) [12:38:50] (03CR) 10Majavah: [V: 03+1 C: 03+2] P:openstack::pdns::recusor: cleanup remains from old setup [puppet] - 10https://gerrit.wikimedia.org/r/960162 (owner: 10Majavah) [12:41:37] (ProbeDown) firing: (3) Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:41:50] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:43:18] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:45:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1023.eqiad.wmnet with reason: Maintenance [12:45:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1023.eqiad.wmnet with reason: Maintenance [12:45:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on es[1024-1025].eqiad.wmnet with reason: Maintenance [12:45:51] (03CR) 10Urbanecm: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960074 (https://phabricator.wikimedia.org/T308139) (owner: 10Sergio Gimeno) [12:45:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on es[1024-1025].eqiad.wmnet with reason: Maintenance [12:46:37] (ProbeDown) firing: (3) Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:47:28] (03CR) 10Ssingh: [C: 03+1] geo-maps: reorder codfw/eqiad in the default [dns] - 10https://gerrit.wikimedia.org/r/959182 (https://phabricator.wikimedia.org/T346474) (owner: 10Kamila Součková) [12:47:51] (03CR) 10Urbanecm: "recheck" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959987 (https://phabricator.wikimedia.org/T347120) (owner: 10Urbanecm) [12:50:35] (03CR) 10Kamila Součková: [C: 03+2] geo-maps: reorder codfw/eqiad in the default [dns] - 10https://gerrit.wikimedia.org/r/959182 (https://phabricator.wikimedia.org/T346474) (owner: 10Kamila Součková) [12:52:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1024.eqiad.wmnet with reason: Maintenance [12:52:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1024.eqiad.wmnet with reason: Maintenance [12:52:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1024 (T344589)', diff saved to https://phabricator.wikimedia.org/P52603 and previous config saved to /var/cache/conftool/dbconfig/20230925-125212-ladsgroup.json [12:53:29] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:54:33] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:56:21] !log put codfw before eqiad in geoDNS defaults [12:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:37] (ProbeDown) resolved: (2) Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:57:39] (03CR) 10Muehlenhoff: python-build: provide a python2 Bullseye image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940161 (https://phabricator.wikimedia.org/T342346) (owner: 10Hashar) [12:58:02] jouncebot: nowandnext [12:58:02] No deployments scheduled for the next 0 hour(s) and 1 minute(s) [12:58:02] In 0 hour(s) and 1 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230925T1300) [12:58:10] (03PS1) 10Aqu: Bump MW Page content change app version [deployment-charts] - 10https://gerrit.wikimedia.org/r/960610 (https://phabricator.wikimedia.org/T344688) [12:58:13] (03CR) 10Urbanecm: [C: 03+2] AddImageFeedbackHandler: Add missing parameters [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959993 (https://phabricator.wikimedia.org/T346277) (owner: 10Urbanecm) [12:58:19] (03CR) 10Urbanecm: [C: 03+2] listTaskCounts: Do not expect tasks key to be present [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959987 (https://phabricator.wikimedia.org/T347120) (owner: 10Urbanecm) [12:58:46] 10SRE, 10Infrastructure-Foundations, 10netops: Juniper RA receive bug CVE-2023-28981 - https://phabricator.wikimedia.org/T334916 (10cmooney) Hmm yeah good point. We can probably upgrade devices to a release with the fix in it before then. [12:59:51] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/960601 (https://phabricator.wikimedia.org/T345983) (owner: 10Majavah) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230925T1300) [13:00:05] sergi0, ihurbain, houseofm, and Urbanecm: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:07] (ProbeDown) firing: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:00:13] I can deploy today [13:00:17] hi all! [13:00:20] i'm around :) [13:00:22] o/ [13:00:23] (03CR) 10Filippo Giunchedi: [C: 03+1] nagios_common: drop unused contact group [puppet] - 10https://gerrit.wikimedia.org/r/960602 (owner: 10Majavah) [13:00:25] i! [13:00:26] hi! [13:00:32] (03CR) 10Majavah: [C: 03+2] icinga: Drop monitoring for *.wmcloud.org certificate [puppet] - 10https://gerrit.wikimedia.org/r/960601 (https://phabricator.wikimedia.org/T345983) (owner: 10Majavah) [13:00:39] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: enable AddLink backend 14th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960074 (https://phabricator.wikimedia.org/T308139) (owner: 10Sergio Gimeno) [13:00:43] (03CR) 10Majavah: [C: 03+2] nagios_common: drop unused contact group [puppet] - 10https://gerrit.wikimedia.org/r/960602 (owner: 10Majavah) [13:01:10] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960074 (https://phabricator.wikimedia.org/T308139) (owner: 10Sergio Gimeno) [13:01:23] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host mwmaint1002.eqiad.wmnet [13:01:26] (03Merged) 10jenkins-bot: GrowthExperiments: enable AddLink backend 14th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960074 (https://phabricator.wikimedia.org/T308139) (owner: 10Sergio Gimeno) [13:01:28] query houseofm [13:01:31] eh [13:01:36] o/ [13:01:42] hi HouseOfM! [13:01:47] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:960074|GrowthExperiments: enable AddLink backend 14th round of wikis (T308139)]] [13:02:00] T308139: Deploy "add a link" to 14th round of wikis - https://phabricator.wikimedia.org/T308139 [13:02:03] (03CR) 10CDanis: [C: 03+1] otel-coll: enable prometheus scraping [deployment-charts] - 10https://gerrit.wikimedia.org/r/960056 (https://phabricator.wikimedia.org/T345712) (owner: 10Filippo Giunchedi) [13:02:34] (03CR) 10CDanis: [C: 03+1] "thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/959950 (https://phabricator.wikimedia.org/T345712) (owner: 10Filippo Giunchedi) [13:03:24] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 59278 [13:03:38] !log cordoned kubernetes10[27-56] [13:03:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:13] !log installing openjdk-11 security updates on buster [13:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:36] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:04:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1024 (T344589)', diff saved to https://phabricator.wikimedia.org/P52604 and previous config saved to /var/cache/conftool/dbconfig/20230925-130444-ladsgroup.json [13:05:07] (ProbeDown) resolved: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:05:37] (ProbeDown) firing: (2) Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:05:42] (03CR) 10Muehlenhoff: [C: 03+1] "Looks okay" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940161 (https://phabricator.wikimedia.org/T342346) (owner: 10Hashar) [13:05:42] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:08:04] (03PS1) 10Majavah: P:openstack::galera: drop nrpe process check [puppet] - 10https://gerrit.wikimedia.org/r/960612 (https://phabricator.wikimedia.org/T345294) [13:08:26] (03CR) 10CI reject: [V: 04-1] listTaskCounts: Do not expect tasks key to be present [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959987 (https://phabricator.wikimedia.org/T347120) (owner: 10Urbanecm) [13:08:32] still preparing the k8s image for deployment... [13:08:44] (JobUnavailable) firing: (6) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:09:18] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwmaint1002.eqiad.wmnet [13:10:26] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:10:37] (ProbeDown) resolved: Service mw-web:4450 has failed probes (http_mw-web_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-web:4450 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:10:44] ehm... i got kubernetes2010.codfw.wmnet port 22: Connection timed out during a scap deployment [13:10:49] (03CR) 10Ayounsi: [C: 03+1] "Noted, thanks, still a bit blurry to me but overall it makes sens!" [homer/public] - 10https://gerrit.wikimedia.org/r/959873 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney) [13:10:52] what's happening? [13:11:01] <_joe_> urbanecm: it's ok, that server is down [13:11:19] _joe_: shouldn't scap ignore it then? [13:11:36] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:11:38] <_joe_> but in theory it should disappear from there yes unless I did something asinine with it :) [13:11:44] <_joe_> urbanecm: did that make scap fail? [13:11:47] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb1010.eqiad.wmnet [13:11:54] _joe_: no, it just yells at me and proceeds. [13:12:01] <_joe_> urbanecm: ok cool [13:12:16] _joe_: it just prints an error during prefetch [13:12:37] <_joe_> jayme: yeah I want to have a way to exclude a node, as I said, I'll look into it [13:13:46] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:13:52] !log uncordoned kubernetes10[27-56] [13:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:57] <_joe_> pdb_query: "Class[Profile::Kubernetes::Mediawiki_runner] and User[mwdeploy]{ensure=present}" [13:13:59] <_joe_> heh [13:14:11] <_joe_> that's not great [13:14:15] seems like no way to exclude a node to me [13:14:20] but i don't know much about puppet :)) [13:14:23] !log urbanecm@deploy2002 urbanecm and sgimeno: Backport for [[gerrit:960074|GrowthExperiments: enable AddLink backend 14th round of wikis (T308139)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:14:24] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:14:25] (03PS1) 10Majavah: cloudlb: remove unused firewall rule [puppet] - 10https://gerrit.wikimedia.org/r/960614 [13:14:29] T308139: Deploy "add a link" to 14th round of wikis - https://phabricator.wikimedia.org/T308139 [13:14:39] sergi0: anyway, available at mwdebug now. but afaik nothing can be tested, right? [13:14:40] !log ran homer "lsw1-*eqiad*" commit - T346714 [13:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:46] T346714: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 [13:14:58] <_joe_> urbanecm: we'll think of a solution; in the meantime proceed please [13:15:05] urbanecm: right, nothing to test [13:15:10] !log urbanecm@deploy2002 urbanecm and sgimeno: Continuing with sync [13:15:15] thanks _joe_, proceeding. [13:15:24] (03CR) 10Filippo Giunchedi: [C: 03+1] "♥" [puppet] - 10https://gerrit.wikimedia.org/r/960612 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah) [13:15:28] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [13:15:28] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:15:51] (03Merged) 10jenkins-bot: AddImageFeedbackHandler: Add missing parameters [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959993 (https://phabricator.wikimedia.org/T346277) (owner: 10Urbanecm) [13:15:54] (03Merged) 10jenkins-bot: listTaskCounts: Do not expect tasks key to be present [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959987 (https://phabricator.wikimedia.org/T347120) (owner: 10Urbanecm) [13:16:15] (03CR) 10Majavah: [C: 03+2] P:openstack::galera: drop nrpe process check [puppet] - 10https://gerrit.wikimedia.org/r/960612 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah) [13:16:41] (03PS2) 10Urbanecm: Enable Parsoid support for Kartographer on enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960552 (https://phabricator.wikimedia.org/T342871) (owner: 10Isabelle Hurbain-Palatin) [13:16:46] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:17:02] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1010.eqiad.wmnet [13:18:16] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:18:37] (03CR) 10Urbanecm: [C: 03+2] Enable Parsoid support for Kartographer on enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960552 (https://phabricator.wikimedia.org/T342871) (owner: 10Isabelle Hurbain-Palatin) [13:19:16] (03Merged) 10jenkins-bot: Enable Parsoid support for Kartographer on enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960552 (https://phabricator.wikimedia.org/T342871) (owner: 10Isabelle Hurbain-Palatin) [13:19:39] ihurbain: i'll be proceeding with your patch soon. [13:19:44] ok :) [13:19:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1024', diff saved to https://phabricator.wikimedia.org/P52605 and previous config saved to /var/cache/conftool/dbconfig/20230925-131951-ladsgroup.json [13:20:34] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50567 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:21:10] (03CR) 10Filippo Giunchedi: [C: 03+1] cloudlb: remove unused firewall rule [puppet] - 10https://gerrit.wikimedia.org/r/960614 (owner: 10Majavah) [13:21:24] (03CR) 10Majavah: [C: 03+2] cloudlb: remove unused firewall rule [puppet] - 10https://gerrit.wikimedia.org/r/960614 (owner: 10Majavah) [13:21:30] !log jayme@cumin1001 conftool action : set/weight=10; selector: service=kubesvc,cluster=kubernetes,dc=eqiad [13:21:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:22:01] !log jayme@cumin1001 conftool action : set/weight=10; selector: service=kubesvc,cluster=kubernetes,dc=codfw [13:22:11] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb1009.eqiad.wmnet [13:22:55] (03CR) 10BBlack: [C: 03+1] geo-maps: reorder codfw/eqiad in the default [dns] - 10https://gerrit.wikimedia.org/r/959182 (https://phabricator.wikimedia.org/T346474) (owner: 10Kamila Součková) [13:23:42] (03CR) 10Hashar: "The image is to build Zuul dependencies (which requires python 2.7) which will allow to migrate the contint* servers from Buster to Bullse" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940161 (https://phabricator.wikimedia.org/T342346) (owner: 10Hashar) [13:23:45] (PuppetConstantChange) firing: (2) Puppet performing a change on every puppet run - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [13:25:15] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:960074|GrowthExperiments: enable AddLink backend 14th round of wikis (T308139)]] (duration: 23m 28s) [13:25:22] T308139: Deploy "add a link" to 14th round of wikis - https://phabricator.wikimedia.org/T308139 [13:25:24] this is...very slow. 20+ minutes per patch. [13:25:26] sergi0: synced [13:25:36] cool, ty! [13:26:03] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:959987|listTaskCounts: Do not expect tasks key to be present (T347120)]], [[gerrit:959993|AddImageFeedbackHandler: Add missing parameters (T346277)]], [[gerrit:960552|Enable Parsoid support for Kartographer on enwikivoyage (T342871)]] [13:26:07] urbanecm: Is it the prefetch to k8s nodes taking time? [13:26:13] T342871: Parsoid + Kartographer roll-out plan - https://phabricator.wikimedia.org/T342871 [13:26:13] T346277: Addimage feedback API cannot be called successfully - https://phabricator.wikimedia.org/T346277 [13:26:14] T347120: PHP Notice: Undefined index: tasks - https://phabricator.wikimedia.org/T347120 [13:26:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:27:12] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T343198)', diff saved to https://phabricator.wikimedia.org/P52606 and previous config saved to /var/cache/conftool/dbconfig/20230925-132711-arnaudb.json [13:27:18] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [13:28:01] !log Restarting CI Jenkins [13:28:03] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1009.eqiad.wmnet [13:28:04] claime: this is the long steps based on transcript: `build-and-push-container-images (duration: 05m 24s)`, `docker pull on k8s nodes (duration: 02m 10s)`, `sync-check-canaries (duration: 01m 18s)`, `sync-prod-k8s (duration: 01m 54s)`, `php-fpm-restarts (duration: 02m 39s)` [13:28:44] (03PS1) 10Peter Fischer: add search update pipeline streams (update + fetch_error) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960616 (https://phabricator.wikimedia.org/T317609) [13:29:11] PROBLEM - Check unit status of netbox_ganeti_esams01_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_esams01_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:30:11] That's probably the redis reboot as well [13:30:16] (the netbox alert) [13:30:59] restarted the service, looks good [13:32:01] let's see how second round of scap will look like. but we won't have time for a third one if it takes 20 mins again. [13:32:31] PROBLEM - Check systemd state on puppetmaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: session-c107960.scope,session-c107961.scope,session-c107962.scope,session-c107963.scope,session-c107964.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:32:59] The build and push shouldn't be that long if the number of changed layers is low, but it is definitely a pain point we'll have to address [13:33:35] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [13:34:12] claime: i'm kind of worried what would happen once a patch needs a quick revert because it takes our site down / breaks editing / whatever. [13:34:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1024', diff saved to https://phabricator.wikimedia.org/P52607 and previous config saved to /var/cache/conftool/dbconfig/20230925-133457-ladsgroup.json [13:35:09] urbanecm: We always have the option of using a helm rollback to the former image for k8s I think [13:35:17] !log jayme@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=kubernetes,name=kubernetes.* [13:35:37] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb2010.codfw.wmnet [13:36:12] !log jayme@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=kubernetes,name=kubernetes.* [13:36:13] !log jayme@cumin1001 conftool action : set/pooled=no; selector: name=kubernetes2010.codfw.wmnet [13:36:21] (03CR) 10DCausse: add search update pipeline streams (update + fetch_error) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960616 (https://phabricator.wikimedia.org/T317609) (owner: 10Peter Fischer) [13:38:33] !log urbanecm@deploy2002 urbanecm and ihurbain: Backport for [[gerrit:959987|listTaskCounts: Do not expect tasks key to be present (T347120)]], [[gerrit:959993|AddImageFeedbackHandler: Add missing parameters (T346277)]], [[gerrit:960552|Enable Parsoid support for Kartographer on enwikivoyage (T342871)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.w [13:38:33] mnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:38:43] T342871: Parsoid + Kartographer roll-out plan - https://phabricator.wikimedia.org/T342871 [13:38:45] T346277: Addimage feedback API cannot be called successfully - https://phabricator.wikimedia.org/T346277 [13:38:46] T347120: PHP Notice: Undefined index: tasks - https://phabricator.wikimedia.org/T347120 [13:38:47] ihurbain: hi, can you test at mwdebug please? [13:38:52] yup, doing that [13:38:54] ty [13:39:15] RECOVERY - Check unit status of netbox_ganeti_esams01_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_esams01_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:41:10] @ubanecm I'll move my patch to a later window [13:41:27] HouseOfM: yes, that seems like a reasonable decision. thanks. we won't have time for another scap sync, unfortunately. [13:41:41] (03PS1) 10Giuseppe Lavagetto: scap::dsh: temporarily exclude kubernetes2010 [puppet] - 10https://gerrit.wikimedia.org/r/960621 (https://phabricator.wikimedia.org/T347267) [13:41:59] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:42:15] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2010.codfw.wmnet [13:42:15] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [13:42:15] <_joe_> urbanecm: if you have time for a last patch [13:42:18] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P52610 and previous config saved to /var/cache/conftool/dbconfig/20230925-134217-arnaudb.json [13:42:27] <_joe_> I would ask you to wait a minute or two [13:42:51] _joe_: i'm mid-scap run. do you want me to abort it? or just complete as-is? [13:42:52] (03CR) 10Giuseppe Lavagetto: [C: 03+2] scap::dsh: temporarily exclude kubernetes2010 [puppet] - 10https://gerrit.wikimedia.org/r/960621 (https://phabricator.wikimedia.org/T347267) (owner: 10Giuseppe Lavagetto) [13:43:05] <_joe_> urbanecm: ccomplete as-is [13:43:11] ack [13:43:13] (03PS1) 10David Caro: wmcs: disable pages from nagios/icinga [puppet] - 10https://gerrit.wikimedia.org/r/960622 [13:43:37] urbanecm: we're good on mwdebug [13:43:41] great, proceeding. [13:43:42] !log urbanecm@deploy2002 urbanecm and ihurbain: Continuing with sync [13:44:21] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:48:48] (03CR) 10Majavah: [C: 04-1] wmcs: disable pages from nagios/icinga (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960622 (owner: 10David Caro) [13:49:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:49:36] <_joe_> urbanecm: ah I got why things were so slow - you're the unlucky one who synced things to the new k8s nodes :) [13:49:47] but why is it slow twice in a row? [13:49:53] <_joe_> but also now I excluded 2010 [13:49:57] <_joe_> yeah that I'm not sure about [13:50:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1024 (T344589)', diff saved to https://phabricator.wikimedia.org/P52611 and previous config saved to /var/cache/conftool/dbconfig/20230925-135004-ladsgroup.json [13:50:12] heh :). thanks for excluding 2010 though. [13:51:18] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb2009.codfw.wmnet [13:52:32] 10SRE, 10serviceops, 10Datacenter-Switchover: Sept 2023 Switchover: list new primary DC servers first in debug.json - https://phabricator.wikimedia.org/T346472 (10kamila) 05Open→03Resolved [13:52:35] 10SRE, 10Data-Persistence, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10kamila) [13:53:07] 10SRE, 10serviceops, 10Datacenter-Switchover: Sept 2023 Switchover Checklist: MediaWiki - https://phabricator.wikimedia.org/T346474 (10kamila) [13:53:26] 10SRE, 10Infrastructure-Foundations, 10netops: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10Jhancock.wm) @cmooney I haven't received it yet. I checked with the dock to make sure it hasn't arrived and we weren't notified but no luck. Is there a tracking number for the package? [13:53:47] (03PS3) 10JMeybohm: Update chromium-render to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958473 (https://phabricator.wikimedia.org/T300033) [13:53:58] 10SRE, 10serviceops, 10Datacenter-Switchover: Sept 2023 Switchover Checklist: MediaWiki - https://phabricator.wikimedia.org/T346474 (10kamila) 05Open→03Resolved [13:54:00] (03CR) 10CI reject: [V: 04-1] Update chromium-render to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958473 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [13:54:03] 10SRE, 10Data-Persistence, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10kamila) [13:54:34] (03CR) 10FNegri: "I'm not clear if this patch will prevent us from being paged if a physical host goes down. Is alertmanager also sending a page, or should " [puppet] - 10https://gerrit.wikimedia.org/r/960622 (owner: 10David Caro) [13:54:35] (KubernetesAPILatency) resolved: (21) High Kubernetes API latency (LIST blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:55:03] _joe_: it's been slow every time I've deployed since the switchover, which makes me doubt that theory [13:55:07] (ProbeDown) firing: Service docker-registry:443 has failed probes (http_docker-registry_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:55:16] (03PS2) 10JMeybohm: Update developer-portal to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958479 (https://phabricator.wikimedia.org/T300033) [13:55:22] like, 20 minutes per patch slow taavi? [13:55:27] yes [13:55:37] <_joe_> taavi: the other option is that deploy2002 has a terrible disk [13:55:49] do we have a task about the post-switchover slowness? if not, i can fill one. [13:56:17] (03CR) 10Elukey: Avoid pages for ores.discovery.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960569 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [13:56:21] it was also that slow last time I deployed, see https://sal.toolforge.org/production?q=959304 [13:56:29] (03PS2) 10Elukey: icinga/nagios: remove check_ores* [puppet] - 10https://gerrit.wikimedia.org/r/960567 (https://phabricator.wikimedia.org/T347278) [13:56:31] (03PS2) 10Elukey: Avoid pages for ores.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/960569 (https://phabricator.wikimedia.org/T347278) [13:56:36] <_joe_> did you report this to release engineering? [13:56:38] (03CR) 10Elukey: icinga/nagios: remove check_ores* (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960567 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [13:57:00] <_joe_> I wouldn't assume the problem is the switchover tbh [13:57:10] <_joe_> but full logs would help us understand what got slower [13:57:25] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P52612 and previous config saved to /var/cache/conftool/dbconfig/20230925-135724-arnaudb.json [13:57:28] https://wikimedia.slack.com/archives/C05H0JYT85V/p1695238823352749 seems like it might be relevant here? [13:57:48] and now i got `skipping missing values file matching "values-main.yaml"`. [13:58:14] and `Error: UPGRADE FAILED: release main failed, and has been rolled back due to atomic being set: timed out waiting for the condition` [13:58:24] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2009.codfw.wmnet [13:58:30] and `13:56:07 Rolling back to prior state...` :-/ [13:59:23] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 59278 [13:59:51] (03PS2) 10JMeybohm: Update mathoid to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/953261 (https://phabricator.wikimedia.org/T300033) (owner: 10Effie Mouzeli) [14:00:04] PROBLEM - Host ganeti2014 is DOWN: PING CRITICAL - Packet loss = 100% [14:00:07] (ProbeDown) resolved: Service docker-registry:443 has failed probes (http_docker-registry_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:00:32] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb2008.codfw.wmnet [14:01:08] 10SRE, 10Traffic: Implement VTC tests for PURGE requests - https://phabricator.wikimedia.org/T347297 (10Fabfur) [14:02:03] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add SLO definition for the ORES Legacy service (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/955355 (https://phabricator.wikimedia.org/T327620) (owner: 10Elukey) [14:02:13] (03CR) 10Elukey: [C: 03+2] alertmanager: create ml team alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/958072 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [14:02:23] (03CR) 10Elukey: [V: 03+2 C: 03+2] Lift Wing: add latency/availability SLO dashboards (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/952226 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman) [14:02:45] (03PS6) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) [14:02:47] (03PS1) 10Andrew Bogott: cloudservices1006: remove old listen-on address [puppet] - 10https://gerrit.wikimedia.org/r/960624 (https://phabricator.wikimedia.org/T346385) [14:03:20] (03CR) 10Andrew Bogott: [C: 03+2] cloudservices1006: remove old listen-on address [puppet] - 10https://gerrit.wikimedia.org/r/960624 (https://phabricator.wikimedia.org/T346385) (owner: 10Andrew Bogott) [14:03:44] RECOVERY - Host ganeti2014 is UP: PING OK - Packet loss = 0%, RTA = 33.19 ms [14:04:07] (JobUnavailable) firing: (5) Reduced availability for job ganeti in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:04:19] <_joe_> urbanecm: do you have a paste of one of your latest scaps? [14:04:38] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:959987|listTaskCounts: Do not expect tasks key to be present (T347120)]], [[gerrit:959993|AddImageFeedbackHandler: Add missing parameters (T346277)]], [[gerrit:960552|Enable Parsoid support for Kartographer on enwikivoyage (T342871)]] (duration: 38m 35s) [14:04:50] T342871: Parsoid + Kartographer roll-out plan - https://phabricator.wikimedia.org/T342871 [14:04:50] T346277: Addimage feedback API cannot be called successfully - https://phabricator.wikimedia.org/T346277 [14:04:51] T347120: PHP Notice: Undefined index: tasks - https://phabricator.wikimedia.org/T347120 [14:07:00] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2008.codfw.wmnet [14:07:05] urbanecm: does it say which environment? [14:07:37] (03PS1) 10JMeybohm: Update machinetranslation to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/960625 (https://phabricator.wikimedia.org/T300033) [14:08:44] (JobUnavailable) firing: (7) Reduced availability for job ganeti in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:10:02] PROBLEM - Host ganeti2014 is DOWN: PING CRITICAL - Packet loss = 100% [14:10:13] <_joe_> we're still running the old version of the code in k8s btw [14:10:25] <_joe_> jayme: you should run a scap --k8s-only tbh [14:12:31] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T343198)', diff saved to https://phabricator.wikimedia.org/P52613 and previous config saved to /var/cache/conftool/dbconfig/20230925-141230-arnaudb.json [14:12:33] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance [14:12:38] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [14:12:42] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1007: move to new network setup - https://phabricator.wikimedia.org/T346892 (10Jclark-ctr) @taavi Relovated to rack E 4. updated netbox with location. switch port is #9 [14:12:46] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance [14:12:49] (03CR) 10Muehlenhoff: admin: Create analytics-wmde system user and airflow admin group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [14:12:53] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2150 (T343198)', diff saved to https://phabricator.wikimedia.org/P52614 and previous config saved to /var/cache/conftool/dbconfig/20230925-141252-arnaudb.json [14:13:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1025.eqiad.wmnet with reason: Maintenance [14:13:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1025.eqiad.wmnet with reason: Maintenance [14:13:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1025 (T344589)', diff saved to https://phabricator.wikimedia.org/P52615 and previous config saved to /var/cache/conftool/dbconfig/20230925-141313-ladsgroup.json [14:13:19] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1007: move to new network setup - https://phabricator.wikimedia.org/T346892 (10Jclark-ctr) [14:13:30] RECOVERY - Host ganeti2014 is UP: PING OK - Packet loss = 0%, RTA = 33.73 ms [14:14:08] (JobUnavailable) firing: (7) Reduced availability for job ganeti in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:16:07] (03CR) 10Jforrester: [C: 03+2] Wikifunctions: Update evaluator image to 2023-09-19-183305 [deployment-charts] - 10https://gerrit.wikimedia.org/r/959037 (owner: 10Jforrester) [14:17:12] (03Merged) 10jenkins-bot: Wikifunctions: Update evaluator image to 2023-09-19-183305 [deployment-charts] - 10https://gerrit.wikimedia.org/r/959037 (owner: 10Jforrester) [14:17:16] (03CR) 10Hnowlan: [C: 03+2] mobileapps: increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/960596 (owner: 10Hnowlan) [14:18:02] (03Merged) 10jenkins-bot: mobileapps: increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/960596 (owner: 10Hnowlan) [14:18:38] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:18:41] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:18:44] (JobUnavailable) firing: (7) Reduced availability for job ganeti in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:19:07] (JobUnavailable) firing: (7) Reduced availability for job ganeti in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:19:12] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:19:14] (03PS1) 10JMeybohm: Remove all quota from mw namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/960626 [14:19:15] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:20:02] <_joe_> taavi: so it seems we're indeed rebuilding from scratch for every release, not sure why. [14:20:19] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:20:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Replace RAID controller battery on an-worker1086 - https://phabricator.wikimedia.org/T347287 (10Jclark-ctr) @BTullis i am on site today. otherwise we can do it tomorrow [14:20:50] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:21:11] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Remove all quota from mw namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/960626 (owner: 10JMeybohm) [14:21:12] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:21:17] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1007: move to new network setup - https://phabricator.wikimedia.org/T346892 (10Jclark-ctr) a:05Jclark-ctr→03taavi [14:21:59] (03CR) 10JMeybohm: [C: 03+2] Remove all quota from mw namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/960626 (owner: 10JMeybohm) [14:22:07] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:22:11] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:22:53] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Remove all quota from mw namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/960626 (owner: 10JMeybohm) [14:22:59] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:23:48] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Remove all quota from mw namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/960626 (owner: 10JMeybohm) [14:23:52] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_ssh-gitlab.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:24] !log herron@cumin1001 START - Cookbook sre.hosts.decommission for hosts dispatch-be2001.codfw.wmnet,dispatch-be1001.eqiad.wmnet [14:24:51] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [14:24:53] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [14:24:59] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [14:25:06] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [14:25:11] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [14:25:21] (03PS2) 10Jforrester: Re-apply "Fix wikifunctions orchestrator not using the service mesh" [deployment-charts] - 10https://gerrit.wikimedia.org/r/953212 (https://phabricator.wikimedia.org/T344998) [14:25:54] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43532/console" [puppet] - 10https://gerrit.wikimedia.org/r/960567 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [14:26:10] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [14:27:24] (03CR) 10Jforrester: [C: 03+2] Re-apply "Fix wikifunctions orchestrator not using the service mesh" [deployment-charts] - 10https://gerrit.wikimedia.org/r/953212 (https://phabricator.wikimedia.org/T344998) (owner: 10Jforrester) [14:28:01] (03CR) 10Ladsgroup: [C: 03+1] Avoid pages for ores.discovery.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960569 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [14:28:38] (03Merged) 10jenkins-bot: Re-apply "Fix wikifunctions orchestrator not using the service mesh" [deployment-charts] - 10https://gerrit.wikimedia.org/r/953212 (https://phabricator.wikimedia.org/T344998) (owner: 10Jforrester) [14:28:41] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [14:29:08] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [14:29:45] !log herron@cumin1001 START - Cookbook sre.dns.netbox [14:30:36] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:31:03] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:31:10] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [14:31:28] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:31:50] !log herron@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dispatch-be2001.codfw.wmnet,dispatch-be1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - herron@cumin1001" [14:31:50] (03CR) 10Klausman: [C: 03+1] icinga/nagios: remove check_ores* [puppet] - 10https://gerrit.wikimedia.org/r/960567 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [14:32:07] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:32:23] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [14:33:08] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:33:34] !log jayme@deploy2002 Started scap: (no justification provided) [14:34:10] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:34:17] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:34:27] (03CR) 10Filippo Giunchedi: "LGTM, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/958807 (https://phabricator.wikimedia.org/T346656) (owner: 10Andrea Denisse) [14:35:09] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:35:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1025 (T344589)', diff saved to https://phabricator.wikimedia.org/P52618 and previous config saved to /var/cache/conftool/dbconfig/20230925-143523-ladsgroup.json [14:36:12] (03PS1) 10Jforrester: Revert "Re-apply "Fix wikifunctions orchestrator not using the service mesh"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/959997 [14:36:16] (03CR) 10Jforrester: [C: 03+2] Revert "Re-apply "Fix wikifunctions orchestrator not using the service mesh"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/959997 (owner: 10Jforrester) [14:36:44] !log jayme@deploy2002 Finished scap: (no justification provided) (duration: 03m 09s) [14:37:00] (03Merged) 10jenkins-bot: Revert "Re-apply "Fix wikifunctions orchestrator not using the service mesh"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/959997 (owner: 10Jforrester) [14:37:06] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nat Hillard - https://phabricator.wikimedia.org/T342588 (10CDanis) 05Stalled→03In progress a:05NHillard-WMF→03CDanis [14:37:26] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! very nice" [puppet] - 10https://gerrit.wikimedia.org/r/960567 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [14:37:57] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:38:26] (03CR) 10Elukey: [V: 03+1 C: 03+2] icinga/nagios: remove check_ores* [puppet] - 10https://gerrit.wikimedia.org/r/960567 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [14:38:47] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:38:55] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:39:38] (03CR) 10Klausman: [C: 03+1] Avoid pages for ores.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/960569 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [14:39:43] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:39:57] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:40:08] (03PS3) 10Elukey: Avoid pages for ores.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/960569 (https://phabricator.wikimedia.org/T347278) [14:40:16] (03CR) 10Klausman: [C: 03+1] ml-services: remove old eswikiquote and eswikibooks models [deployment-charts] - 10https://gerrit.wikimedia.org/r/960234 (https://phabricator.wikimedia.org/T342266) (owner: 10Ilias Sarantopoulos) [14:40:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:40:48] (03CR) 10Elukey: [C: 03+2] Avoid pages for ores.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/960569 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [14:41:18] (03PS1) 10Jclark-ctr: new node T342660 [puppet] - 10https://gerrit.wikimedia.org/r/960631 (https://phabricator.wikimedia.org/T342660) [14:41:49] (03CR) 10CI reject: [V: 04-1] new node T342660 [puppet] - 10https://gerrit.wikimedia.org/r/960631 (https://phabricator.wikimedia.org/T342660) (owner: 10Jclark-ctr) [14:42:01] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: add jaeger query/collector alerts [alerts] - 10https://gerrit.wikimedia.org/r/959950 (https://phabricator.wikimedia.org/T345712) (owner: 10Filippo Giunchedi) [14:42:40] (03CR) 10Filippo Giunchedi: [C: 03+2] otel-coll: enable prometheus scraping [deployment-charts] - 10https://gerrit.wikimedia.org/r/960056 (https://phabricator.wikimedia.org/T345712) (owner: 10Filippo Giunchedi) [14:42:48] (03CR) 10Elukey: [C: 03+1] APIGW: add entry for multilingual readability LW isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/959684 (https://phabricator.wikimedia.org/T334182) (owner: 10Klausman) [14:43:36] !log herron@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dispatch-be2001.codfw.wmnet,dispatch-be1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - herron@cumin1001" [14:43:36] !log herron@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:43:37] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dispatch-be2001.codfw.wmnet,dispatch-be1001.eqiad.wmnet [14:43:40] (03PS38) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [14:43:42] (03PS1) 10AOkoth: gitlab: change service_name on replica hosts [puppet] - 10https://gerrit.wikimedia.org/r/960632 (https://phabricator.wikimedia.org/T345590) [14:44:35] (03PS2) 10Herron: remove dispatch dns record [dns] - 10https://gerrit.wikimedia.org/r/957799 (https://phabricator.wikimedia.org/T344937) [14:44:37] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb2007.codfw.wmnet [14:45:14] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T347257 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm caused by issue in T347267. will resolve there. [14:45:33] (03PS2) 10Jclark-ctr: new node T342660 [puppet] - 10https://gerrit.wikimedia.org/r/960631 (https://phabricator.wikimedia.org/T342660) [14:45:34] (KubernetesAPILatency) resolved: (8) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:45:49] !log filippo@deploy2002 helmfile [eqiad] START helmfile.d/services/opentelemetry-collector: apply [14:45:56] !log filippo@deploy2002 helmfile [eqiad] DONE helmfile.d/services/opentelemetry-collector: apply [14:45:58] (03CR) 10CI reject: [V: 04-1] new node T342660 [puppet] - 10https://gerrit.wikimedia.org/r/960631 (https://phabricator.wikimedia.org/T342660) (owner: 10Jclark-ctr) [14:46:07] !log filippo@deploy2002 helmfile [codfw] START helmfile.d/services/opentelemetry-collector: apply [14:46:12] !log filippo@deploy2002 helmfile [codfw] DONE helmfile.d/services/opentelemetry-collector: apply [14:47:44] (03PS3) 10Jclark-ctr: new node T342660 [puppet] - 10https://gerrit.wikimedia.org/r/960631 (https://phabricator.wikimedia.org/T342660) [14:47:57] (03CR) 10Hnowlan: [C: 03+1] "One nit, otherwise lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/959684 (https://phabricator.wikimedia.org/T334182) (owner: 10Klausman) [14:48:03] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: kubernetes2010 down - https://phabricator.wikimedia.org/T347267 (10Jhancock.wm) server is not getting to POST. starting troubleshooting. [14:48:09] (03CR) 10CI reject: [V: 04-1] new node T342660 [puppet] - 10https://gerrit.wikimedia.org/r/960631 (https://phabricator.wikimedia.org/T342660) (owner: 10Jclark-ctr) [14:48:11] (03CR) 10Herron: [C: 03+2] remove dispatch dns record [dns] - 10https://gerrit.wikimedia.org/r/957799 (https://phabricator.wikimedia.org/T344937) (owner: 10Herron) [14:48:30] (03PS1) 10AOkoth: gitlab: swap replica records [dns] - 10https://gerrit.wikimedia.org/r/960633 (https://phabricator.wikimedia.org/T345590) [14:48:33] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Aisha Khatun - https://phabricator.wikimedia.org/T346796 (10colewhite) >>! In T346796#9192790, @AKhatun_WMF wrote: > I am getting this error when I kinit > `kinit: Client 'akhatun@WIKIMEDIA' not found in Kerberos database while... [14:48:53] (03PS2) 10AOkoth: gitlab: swap replica records [dns] - 10https://gerrit.wikimedia.org/r/960633 (https://phabricator.wikimedia.org/T345590) [14:49:04] (03PS2) 10AOkoth: gitlab: change service_name on replica hosts [puppet] - 10https://gerrit.wikimedia.org/r/960632 (https://phabricator.wikimedia.org/T345590) [14:49:09] (03PS4) 10Jclark-ctr: new node T342660 [puppet] - 10https://gerrit.wikimedia.org/r/960631 (https://phabricator.wikimedia.org/T342660) [14:49:16] (03PS2) 10Klausman: APIGW: add entry for multilingual readability LW isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/959684 (https://phabricator.wikimedia.org/T334182) [14:49:35] (03CR) 10CI reject: [V: 04-1] new node T342660 [puppet] - 10https://gerrit.wikimedia.org/r/960631 (https://phabricator.wikimedia.org/T342660) (owner: 10Jclark-ctr) [14:49:56] (03CR) 10Klausman: APIGW: add entry for multilingual readability LW isvc (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/959684 (https://phabricator.wikimedia.org/T334182) (owner: 10Klausman) [14:50:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1025', diff saved to https://phabricator.wikimedia.org/P52619 and previous config saved to /var/cache/conftool/dbconfig/20230925-145029-ladsgroup.json [14:52:12] (03CR) 10CI reject: [V: 04-1] gitlab: swap replica records [dns] - 10https://gerrit.wikimedia.org/r/960633 (https://phabricator.wikimedia.org/T345590) (owner: 10AOkoth) [14:53:25] (03PS5) 10Jclark-ctr: new node T342660 [puppet] - 10https://gerrit.wikimedia.org/r/960631 (https://phabricator.wikimedia.org/T342660) [14:53:26] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:53:28] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2007.codfw.wmnet [14:54:06] (03CR) 10Jclark-ctr: [C: 03+2] new node T342660 [puppet] - 10https://gerrit.wikimedia.org/r/960631 (https://phabricator.wikimedia.org/T342660) (owner: 10Jclark-ctr) [14:54:32] (03CR) 10Herron: [C: 03+2] dispatch: remove puppetization [puppet] - 10https://gerrit.wikimedia.org/r/957756 (https://phabricator.wikimedia.org/T344937) (owner: 10Herron) [14:54:48] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:55:41] (03CR) 10EoghanGaffney: [C: 03+1] gitlab: delay restore timer 30 minutes [puppet] - 10https://gerrit.wikimedia.org/r/959683 (owner: 10Jelto) [14:56:44] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:57:06] !log installing python3.7 security updates [14:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:30] (03PS6) 10Andrea Denisse: prometheus: Prevent Prometheus from scraping certain statsd-exporters [puppet] - 10https://gerrit.wikimedia.org/r/958807 (https://phabricator.wikimedia.org/T346656) [14:58:08] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:59:37] (03CR) 10Elukey: new node T342660 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960631 (https://phabricator.wikimedia.org/T342660) (owner: 10Jclark-ctr) [15:00:15] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:02:28] (03CR) 10David Caro: wmcs: disable pages from nagios/icinga (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960622 (owner: 10David Caro) [15:04:02] (03CR) 10JHathaway: [C: 03+2] puppetserver: Serve the full cert chain via jetty [puppet] - 10https://gerrit.wikimedia.org/r/959238 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway) [15:04:14] (03CR) 10JHathaway: [C: 03+2] "thanks for the review" [puppet] - 10https://gerrit.wikimedia.org/r/959238 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway) [15:04:39] (03CR) 10JHathaway: [C: 03+2] "thanks for the review" [puppet] - 10https://gerrit.wikimedia.org/r/959241 (https://phabricator.wikimedia.org/T344868) (owner: 10JHathaway) [15:04:55] (03CR) 10JHathaway: [C: 03+2] "thanks for the review" [puppet] - 10https://gerrit.wikimedia.org/r/959234 (https://phabricator.wikimedia.org/T344868) (owner: 10JHathaway) [15:05:12] (03CR) 10JHathaway: "thanks for the review" [puppet] - 10https://gerrit.wikimedia.org/r/959232 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [15:05:14] (03CR) 10JHathaway: [C: 03+2] puppetdb prometheus exporter: in a container listen on all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/959232 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [15:05:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1025', diff saved to https://phabricator.wikimedia.org/P52620 and previous config saved to /var/cache/conftool/dbconfig/20230925-150536-ladsgroup.json [15:06:04] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:06:08] (03CR) 10JHathaway: [C: 03+2] "thanks for the review" [puppet] - 10https://gerrit.wikimedia.org/r/959226 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [15:06:16] (03CR) 10AikoChou: [C: 03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/959684 (https://phabricator.wikimedia.org/T334182) (owner: 10Klausman) [15:06:32] (03CR) 10JHathaway: [C: 03+2] "thanks for the review" [puppet] - 10https://gerrit.wikimedia.org/r/959224 (https://phabricator.wikimedia.org/T344868) (owner: 10JHathaway) [15:06:49] (03CR) 10JHathaway: [C: 03+2] "thanks for the reviews" [puppet] - 10https://gerrit.wikimedia.org/r/959227 (owner: 10JHathaway) [15:07:00] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50569 bytes in 2.464 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:08:10] (03PS2) 10Peter Fischer: add search update pipeline streams (update + fetch_error) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960616 (https://phabricator.wikimedia.org/T317609) [15:08:57] (03CR) 10Peter Fischer: add search update pipeline streams (update + fetch_error) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960616 (https://phabricator.wikimedia.org/T317609) (owner: 10Peter Fischer) [15:10:16] (03PS1) 10Muehlenhoff: standard_packages: Remove Python 3.7 packages after buster->bullseye update [puppet] - 10https://gerrit.wikimedia.org/r/960634 [15:10:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:11:58] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:12:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10Jclark-ctr) [15:12:34] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10Jclark-ctr) [15:13:42] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: kubernetes2010 down - https://phabricator.wikimedia.org/T347267 (10Jhancock.wm) @JMeybohm looks like the system board has died. Server powers on, but even with minimum hardware configuration the server will not actually boot up. Idrac is also inaccessible. This... [15:14:09] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: kubernetes2010 down - https://phabricator.wikimedia.org/T347267 (10Jhancock.wm) a:03Jhancock.wm [15:14:32] !log taavi@cumin1001 START - Cookbook sre.dns.netbox [15:15:13] (03CR) 10Filippo Giunchedi: "The patch LGTM, I thought about it a little bit and it'll work as-is, however:" [puppet] - 10https://gerrit.wikimedia.org/r/958807 (https://phabricator.wikimedia.org/T346656) (owner: 10Andrea Denisse) [15:15:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:16:12] (03PS1) 10FNegri: Add more details to Readme [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/960637 [15:16:54] !log taavi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign new IPs to cloudcontrol1007 - taavi@cumin1001" [15:17:07] (03CR) 10JHathaway: puppetdb: preseed to avoid creating database users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959231 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [15:17:43] !log taavi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign new IPs to cloudcontrol1007 - taavi@cumin1001" [15:17:43] !log taavi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:19:00] !log alert[12]001 -- apt remove docker.io T344937 [15:19:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:07] T344937: Decom dispatch infrastructure - https://phabricator.wikimedia.org/T344937 [15:20:22] (03PS1) 10Andrea Denisse: superset: Disable Prometheus scraping for superset metrics [puppet] - 10https://gerrit.wikimedia.org/r/960638 (https://phabricator.wikimedia.org/T346656) [15:20:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1025 (T344589)', diff saved to https://phabricator.wikimedia.org/P52621 and previous config saved to /var/cache/conftool/dbconfig/20230925-152043-ladsgroup.json [15:20:44] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [15:21:28] !log alert[12]001 -- rm /etc/apache2/sites-available/50-dispatch-wikimedia-org.conf && apachectl graceful T344937 [15:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:36] (03CR) 10FNegri: "For some reason I don't have +2 rights on this repo. I have built the package on mcrouter.packaging.eqiad1.wikimedia.cloud, can you please" [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/959212 (https://phabricator.wikimedia.org/T346762) (owner: 10FNegri) [15:21:54] !log taavi@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudcontrol1007 [15:22:34] !log taavi@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcontrol1007 [15:23:05] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new records for cloudcontrol1007 - cmooney@cumin1001" [15:23:54] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new records for cloudcontrol1007 - cmooney@cumin1001" [15:23:54] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:23:54] (03PS1) 10Muehlenhoff: puppetdb: Remove obsolete Hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/960641 [15:24:25] (03CR) 10Muehlenhoff: [C: 03+2] Remove traceback-roots [puppet] - 10https://gerrit.wikimedia.org/r/960548 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [15:24:30] (03PS2) 10Muehlenhoff: Remove traceback-roots [puppet] - 10https://gerrit.wikimedia.org/r/960548 (https://phabricator.wikimedia.org/T276465) [15:26:14] (03PS3) 10Cwhite: prometheus: add option to configure probe-specific params [puppet] - 10https://gerrit.wikimedia.org/r/958981 (https://phabricator.wikimedia.org/T346893) [15:26:16] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:26:42] (03PS1) 10Majavah: site: re-assign role for cloudcontrol1007 [puppet] - 10https://gerrit.wikimedia.org/r/960642 (https://phabricator.wikimedia.org/T346892) [15:27:23] (03PS4) 10Cwhite: prometheus: add option to configure probe-specific params [puppet] - 10https://gerrit.wikimedia.org/r/958981 (https://phabricator.wikimedia.org/T346893) [15:27:36] (03PS2) 10Majavah: site: re-assign role for cloudcontrol1007 [puppet] - 10https://gerrit.wikimedia.org/r/960642 (https://phabricator.wikimedia.org/T346892) [15:28:39] (03PS3) 10Majavah: site: re-assign role for cloudcontrol1007 [puppet] - 10https://gerrit.wikimedia.org/r/960642 (https://phabricator.wikimedia.org/T346892) [15:28:45] (03CR) 10Arturo Borrero Gonzalez: site: re-assign role for cloudcontrol1007 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960642 (https://phabricator.wikimedia.org/T346892) (owner: 10Majavah) [15:28:57] (03CR) 10Majavah: site: re-assign role for cloudcontrol1007 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960642 (https://phabricator.wikimedia.org/T346892) (owner: 10Majavah) [15:29:22] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] site: re-assign role for cloudcontrol1007 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960642 (https://phabricator.wikimedia.org/T346892) (owner: 10Majavah) [15:29:33] 10SRE, 10ops-eqiad, 10Patch-For-Review, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1007: move to new network setup - https://phabricator.wikimedia.org/T346892 (10taavi) [15:29:56] (03CR) 10Elukey: [C: 03+1] standard_packages: Remove Python 3.7 packages after buster->bullseye update [puppet] - 10https://gerrit.wikimedia.org/r/960634 (owner: 10Muehlenhoff) [15:30:05] jan_drewniak: Dear deployers, time to do the Wikimedia Portals Update deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230925T1530). [15:30:07] (03CR) 10Filippo Giunchedi: [C: 03+1] standard_packages: Remove Python 3.7 packages after buster->bullseye update [puppet] - 10https://gerrit.wikimedia.org/r/960634 (owner: 10Muehlenhoff) [15:30:36] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: add option to configure probe-specific params [puppet] - 10https://gerrit.wikimedia.org/r/958981 (https://phabricator.wikimedia.org/T346893) (owner: 10Cwhite) [15:30:51] (03CR) 10Cwhite: [C: 03+2] prometheus: add option to configure probe-specific params (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/958981 (https://phabricator.wikimedia.org/T346893) (owner: 10Cwhite) [15:31:30] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: kubernetes2010 down - https://phabricator.wikimedia.org/T347267 (10JMeybohm) Thanks! We did not plan to decom immediately, so it would really help us if you could replace the board and we could run the server for a bit longer. [15:33:12] 10SRE, 10Traffic, 10Patch-For-Review: Alert on Varnish high thread count - https://phabricator.wikimedia.org/T323723 (10BCornwall) 05Resolved→03In progress [15:33:18] (03PS1) 10Cwhite: wmflib: fix typo in probe type [puppet] - 10https://gerrit.wikimedia.org/r/959985 (https://phabricator.wikimedia.org/T346893) [15:33:24] 10SRE, 10Traffic, 10Patch-For-Review: Alert on Varnish high thread count - https://phabricator.wikimedia.org/T323723 (10BCornwall) @Vgutierrez Thanks for your patch fixing thread_pool_max; IIRC @bblack had advised the flat 12000 max threads due to the arbitrary nature of the processorcount. Is this patch to... [15:33:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10VRiley-WMF) wdqs1020 - E 2. U 18 CableID 2303045000257 Port 38 wdqs1021 - F 2. U 42. CableID 2303045000256 Port 20 wdqs1022 - D 2. U 13. CableID 230304500202 Port 25 wdqs1023... [15:33:53] (03Abandoned) 10Cwhite: wmflib: fix typo in probe type [puppet] - 10https://gerrit.wikimedia.org/r/959985 (https://phabricator.wikimedia.org/T346893) (owner: 10Cwhite) [15:34:21] (03CR) 10Filippo Giunchedi: "LGTM, see my comment on https://gerrit.wikimedia.org/r/c/operations/puppet/+/958807 about deploying that patch, once that's done we can me" [puppet] - 10https://gerrit.wikimedia.org/r/960638 (https://phabricator.wikimedia.org/T346656) (owner: 10Andrea Denisse) [15:37:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1032.eqiad.wmnet with reason: Maintenance [15:37:33] (03CR) 10Elukey: [C: 03+1] ml-services: remove old eswikiquote and eswikibooks models [deployment-charts] - 10https://gerrit.wikimedia.org/r/960234 (https://phabricator.wikimedia.org/T342266) (owner: 10Ilias Sarantopoulos) [15:37:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1032.eqiad.wmnet with reason: Maintenance [15:39:02] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:39:54] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:40:00] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: upload_puppet_facts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:41:21] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: remove old eswikiquote and eswikibooks models [deployment-charts] - 10https://gerrit.wikimedia.org/r/960234 (https://phabricator.wikimedia.org/T342266) (owner: 10Ilias Sarantopoulos) [15:42:18] (03Merged) 10jenkins-bot: ml-services: remove old eswikiquote and eswikibooks models [deployment-charts] - 10https://gerrit.wikimedia.org/r/960234 (https://phabricator.wikimedia.org/T342266) (owner: 10Ilias Sarantopoulos) [15:43:19] Amir1: elukey: hmm, https://petscan.wmflabs.org/ seems to expect that ores can return something in a javascript callback format instead of being JSON. is that supposed to be supported? [15:43:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1033.eqiad.wmnet with reason: Maintenance [15:44:14] (03PS5) 10C. Scott Ananian: Re-enable Extension:ParserMigration on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944978 (https://phabricator.wikimedia.org/T333179) (owner: 10Sbailey) [15:44:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1033.eqiad.wmnet with reason: Maintenance [15:45:08] taavi: never heard about it, not even from logs.. ores legacy definitely doesn't support a js callback, didn't even know that ores supported that. Do you have a moment to open a task with the query that it is made? [15:47:56] (03CR) 10Andrew Bogott: [C: 03+2] Package for Debian Bookworm [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/959212 (https://phabricator.wikimedia.org/T346762) (owner: 10FNegri) [15:48:44] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 5.284 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:49:30] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50567 bytes in 0.060 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:49:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1034.eqiad.wmnet with reason: Maintenance [15:50:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1034.eqiad.wmnet with reason: Maintenance [15:50:54] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:52:18] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:54:05] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [15:55:04] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [15:55:06] (03PS1) 10CDanis: nathillard analytics-privatedata-users access [puppet] - 10https://gerrit.wikimedia.org/r/960647 (https://phabricator.wikimedia.org/T342588) [15:55:56] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:55:59] (03CR) 10CDanis: [C: 03+2] nathillard analytics-privatedata-users access [puppet] - 10https://gerrit.wikimedia.org/r/960647 (https://phabricator.wikimedia.org/T342588) (owner: 10CDanis) [15:56:40] 10SRE, 10Traffic, 10Patch-For-Review: Alert on Varnish high thread count - https://phabricator.wikimedia.org/T323723 (10BBlack) To clarify and expand on my position about this thread count parameter (which is really just a side-issue related to this ticket, which is fundamentally complete): 1. Varnish's thr... [15:56:46] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:57:01] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [15:57:32] (03PS3) 10AOkoth: gitlab: swap replica records [dns] - 10https://gerrit.wikimedia.org/r/960633 (https://phabricator.wikimedia.org/T345590) [15:57:46] (03PS4) 10AOkoth: gitlab: swap replica records [dns] - 10https://gerrit.wikimedia.org/r/960633 (https://phabricator.wikimedia.org/T345590) [15:58:32] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Nat Hillard - https://phabricator.wikimedia.org/T342588 (10CDanis) 05In progress→03Resolved Hi Issac, sorry this slipped through SRE's process as well -- this should have been taken care of last week.... [15:58:57] /26 [16:00:32] (DatasourceError) firing: Nonwrite HTTP requests with primary DB connections alert - https://grafana.wikimedia.org/alerting/grafana/4tAKSjJVz/view - https://wikitech.wikimedia.org/wiki/Monitoring/DatasourceError - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [16:01:03] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:01:09] (03CR) 10CI reject: [V: 04-1] gitlab: swap replica records [dns] - 10https://gerrit.wikimedia.org/r/960633 (https://phabricator.wikimedia.org/T345590) (owner: 10AOkoth) [16:01:28] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [16:03:36] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50568 bytes in 0.169 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:04:16] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.286 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:04:57] (03CR) 10Peter Fischer: "Thank you for adapting the chart! Just noticed a config-naming-issue in the consumer (fetch failure -> fetch error)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (owner: 10Ebernhardson) [16:06:03] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:06:55] PROBLEM - Host db2109 #page is DOWN: PING CRITICAL - Packet loss = 100% [16:07:45] hmm, I will depool [16:07:51] that's non-paging I guess? [16:07:55] paging [16:08:26] well, I'm basing that on the lack of "# page" that's usually present (spacing that out because I think it triggers) [16:08:47] and that I got no page, even though I'm in business hours. Even when I open splunk, nothing. [16:08:47] It's between the hostname and "is DOWN" [16:08:49] It's there [16:08:57] oh it is there, so the rest of my questions remain [16:09:05] !log sukhe@cumin2002 dbctl commit (dc=all): 'Depool db2109', diff saved to https://phabricator.wikimedia.org/P52622 and previous config saved to /var/cache/conftool/dbconfig/20230925-160904-sukhe.json [16:09:17] It is not in sirenbot's incidents though [16:09:24] depooled [16:09:33] on active, no triggered, no acked, in the splunk UI on my phone [16:09:36] s/on/no/ [16:09:55] you wouldn't have gotten paged because you aren't on call, but it does also seem like the page never got to VO in the first place [16:10:18] oh, no I do see it under triggered [16:10:33] I have one showing there now, too [16:10:33] but a few minutes delayed getting there, which isn't good [16:10:37] but I didn't until just now :) [16:10:43] I just got the push notification [16:10:43] weird! [16:10:45] and I am oncall [16:10:48] thanks sukhe [16:11:14] Just got paged too. [16:11:22] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [16:11:26] so it's just some delay issue [16:11:38] at least a few minutes [16:11:41] icinga --> victorops uses SMTP, iirc [16:11:57] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [16:12:00] there are still mw errors but it is the dumping process [16:12:06] not end-user errors [16:12:31] or some other process in mwmaint2002 [16:14:46] (03Abandoned) 10Ilias Sarantopoulos: api-gateway: change liftwing hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/940945 (https://phabricator.wikimedia.org/T342266) (owner: 10Ilias Sarantopoulos) [16:15:32] (DatasourceError) resolved: Nonwrite HTTP requests with primary DB connections alert - https://grafana.wikimedia.org/alerting/grafana/4tAKSjJVz/view - https://wikitech.wikimedia.org/wiki/Monitoring/DatasourceError - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [16:16:29] this is a performance alert AFAICT, why is it alerting in here? [16:17:25] It is "MWScript.php migrateLinksTable --wiki=ruwikinews --table pagelinks --batch-size 10000 --sleep 0.1" [16:17:56] hopefully someone can search the process on mwmaint2002 on kill it [16:18:02] *and kill it [16:18:15] so it doesn't send more errors to the logs [16:18:38] elukey: https://phabricator.wikimedia.org/T347317 [16:18:55] am I needed? [16:19:02] no, it is a mw job [16:19:17] that hasn't updated after the depool [16:19:38] db2109 looks to have been just ... powered off?? [16:19:46] there's nothing in the SEL [16:19:49] and powerstatus is OFF [16:19:52] I'll create a task for it [16:19:59] cdanis: maybe a loose cable ( [16:20:01] ? [16:20:07] marostegui: two loose cables? [16:20:14] I thought we had redundant PSUs [16:20:14] marostegui: I mean that it could be handled tomorrow, it was not an emergency [16:20:25] after the depool [16:20:59] thanks jynus [16:21:03] I just created the task [16:21:20] https://phabricator.wikimedia.org/T347318 [16:22:59] 10ops-codfw, 10DBA: db2109 crashed - https://phabricator.wikimedia.org/T347318 (10CDanis) Nothing recent in the SAL. racadm serveraction powerstatus reports OFF. I guess someone asked the host to shut down via the management interface? [16:23:02] thanks marostegui [16:27:34] RECOVERY - Host kubernetes2010 is UP: PING OK - Packet loss = 0%, RTA = 31.66 ms [16:28:46] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 175, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:32:30] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:34:26] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:34:30] <_joe_> uh [16:34:36] <_joe_> is that k8s2010? [16:35:08] seems like it, there is a recovery at lesst [16:41:34] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 175, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:42:45] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: kubernetes2010 down - https://phabricator.wikimedia.org/T347267 (10Jhancock.wm) got it replaced. updated the asset tag, idrac IP, bios/idrac firmware, and adjusted some bios settings. the idrac and network addresses are pinging, and there are no alerts that I can... [16:47:20] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:48:46] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:51:53] !log uncordon kubernetes2010.codfw.wmnet - T347267 [16:51:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:00] T347267: kubernetes2010 down - https://phabricator.wikimedia.org/T347267 [16:53:03] (03PS1) 10JMeybohm: Revert "scap::dsh: temporarily exclude kubernetes2010" [puppet] - 10https://gerrit.wikimedia.org/r/960003 [16:53:18] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:53:18] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: kubernetes2010 down - https://phabricator.wikimedia.org/T347267 (10JMeybohm) Nice, thanks for handling this so quickly! Nothing more to do from your end [16:54:06] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: kubernetes2010 down - https://phabricator.wikimedia.org/T347267 (10Jhancock.wm) 05Open→03Resolved [16:54:19] !log jayme@cumin1001 conftool action : set/pooled=yes; selector: name=kubernetes2010.codfw.wmnet [16:54:44] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:55:21] (03CR) 10JMeybohm: [C: 03+2] Revert "scap::dsh: temporarily exclude kubernetes2010" [puppet] - 10https://gerrit.wikimedia.org/r/960003 (owner: 10JMeybohm) [16:55:44] (03PS2) 10JMeybohm: Revert "scap::dsh: temporarily exclude kubernetes2010" [puppet] - 10https://gerrit.wikimedia.org/r/960003 (https://phabricator.wikimedia.org/T347267) [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230925T1700) [17:00:05] ryankemper: How many deployers does it take to do Wikidata Query Service weekly deploy deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230925T1700). [17:04:16] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:05:42] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:07:26] (03PS5) 10AOkoth: gitlab: swap replica records [dns] - 10https://gerrit.wikimedia.org/r/960633 (https://phabricator.wikimedia.org/T345590) [17:07:36] (03PS6) 10AOkoth: gitlab: swap replica records [dns] - 10https://gerrit.wikimedia.org/r/960633 (https://phabricator.wikimedia.org/T345590) [17:12:30] (03CR) 10Abijeet Patro: [V: 03+2] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/960582 (owner: 10L10n-bot) [17:16:27] 10SRE, 10ops-codfw, 10DBA: db2109 crashed - https://phabricator.wikimedia.org/T347318 (10Jhancock.wm) a:03Jhancock.wm 2023-09-25 16:05:43 SYS1001 System is turning off. 2023-09-25 16:05:43 SYS1003 System CPU Resetting. 2023-08-22 02:14:32 SYS1003 System CPU Resetting. There's couldn't find... [17:24:07] (PuppetConstantChange) firing: (2) Puppet performing a change on every puppet run - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [17:39:04] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:42:47] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10CDanis) a:05Eevans→03darthmon_wmde [17:43:12] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service,httpbb_kubernetes_mw-api-ext_hourly.service,httpbb_kubernetes_mw-api-int_hourly.service,httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:44:24] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [17:44:48] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:46:10] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:47:51] (03CR) 10Subramanya Sastry: "This feels stalled for a while now ... anything needed on my end to move this forward? We are only using local dbs and none of the product" [puppet] - 10https://gerrit.wikimedia.org/r/957251 (https://phabricator.wikimedia.org/T345220) (owner: 10Ladsgroup) [17:49:50] (03PS1) 10Ebernhardson: k8s config: Provide zookeeper hostnames [puppet] - 10https://gerrit.wikimedia.org/r/960662 [17:50:06] PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:50:15] (03CR) 10CI reject: [V: 04-1] k8s config: Provide zookeeper hostnames [puppet] - 10https://gerrit.wikimedia.org/r/960662 (owner: 10Ebernhardson) [17:51:28] RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:53:59] (03PS2) 10Ebernhardson: k8s config: Provide zookeeper hostnames [puppet] - 10https://gerrit.wikimedia.org/r/960662 [17:54:23] (03CR) 10CI reject: [V: 04-1] k8s config: Provide zookeeper hostnames [puppet] - 10https://gerrit.wikimedia.org/r/960662 (owner: 10Ebernhardson) [17:58:27] (03PS5) 10Ebernhardson: Pull some flink config down into the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T346315) [18:09:05] (03PS3) 10Ebernhardson: k8s config: Provide zookeeper hostnames [puppet] - 10https://gerrit.wikimedia.org/r/960662 [18:11:24] (03PS5) 10Bking: cloudelastic: new partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/960114 (https://phabricator.wikimedia.org/T342463) [18:17:11] (03PS1) 10Bking: wdqs: re-enable LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/960664 (https://phabricator.wikimedia.org/T347284) [18:23:44] (JobUnavailable) firing: (4) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:27:44] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [18:31:02] (03CR) 10Ebernhardson: Pull some flink config down into the chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T346315) (owner: 10Ebernhardson) [18:33:09] (03Abandoned) 10Bking: wdqs: re-enable LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/960664 (https://phabricator.wikimedia.org/T347284) (owner: 10Bking) [18:37:49] (03PS7) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) [18:38:37] (03PS1) 10Bking: trafficserver: use wdqs1015 as LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/960687 (https://phabricator.wikimedia.org/T347284) [18:39:25] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/960687 (https://phabricator.wikimedia.org/T347284) (owner: 10Bking) [18:41:50] (03PS2) 10Bking: trafficserver: use wdqs1015 as LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/960687 (https://phabricator.wikimedia.org/T347284) [18:44:57] (03PS1) 10Marostegui: db2109.: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/960688 (https://phabricator.wikimedia.org/T347318) [18:45:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db2109.codfw.wmnet with reason: Host crashed [18:45:40] (03CR) 10BCornwall: [C: 03+2] package_builder: add piuparts package [puppet] - 10https://gerrit.wikimedia.org/r/956968 (owner: 10BCornwall) [18:45:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2109.codfw.wmnet with reason: Host crashed [18:46:01] 10SRE, 10ops-codfw, 10DBA, 10Patch-For-Review: db2109 crashed - https://phabricator.wikimedia.org/T347318 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=048e37aa-4014-4b71-85fd-37c023deeb00) set by marostegui@cumin1001 for 7 days, 0:00:00 on 1 host(s) and their services with reason: Ho... [18:46:07] (03CR) 10Marostegui: [C: 03+2] db2109.: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/960688 (https://phabricator.wikimedia.org/T347318) (owner: 10Marostegui) [18:47:26] 10ops-codfw, 10DBA: db2109 crashed - https://phabricator.wikimedia.org/T347318 (10Marostegui) Downtimed it for a week [18:48:58] PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:50:22] RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:52:07] (03PS8) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) [18:53:00] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:55:07] (03PS4) 10Ebernhardson: k8s config: Provide kafka and zookeeper hostnames [puppet] - 10https://gerrit.wikimedia.org/r/960662 [18:55:10] (03PS3) 10Ebernhardson: flink-app: Provide kafka hosts as properties file [deployment-charts] - 10https://gerrit.wikimedia.org/r/959066 [18:55:52] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:57:56] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:59:02] (03PS9) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) [18:59:22] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:04:57] (03CR) 10Ryan Kemper: [C: 03+1] trafficserver: use wdqs1015 as LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/960687 (https://phabricator.wikimedia.org/T347284) (owner: 10Bking) [19:05:36] (03CR) 10Ryan Kemper: [C: 03+2] trafficserver: use wdqs1015 as LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/960687 (https://phabricator.wikimedia.org/T347284) (owner: 10Bking) [19:07:02] (03CR) 10AOkoth: [C: 03+1] gitlab: delay restore timer 30 minutes [puppet] - 10https://gerrit.wikimedia.org/r/959683 (owner: 10Jelto) [19:07:24] (03CR) 10AOkoth: [C: 03+1] gitlab: remove deprecated grafana feature [puppet] - 10https://gerrit.wikimedia.org/r/959689 (owner: 10Jelto) [19:07:46] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:08:32] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:09:48] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.269 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:10:36] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:11:41] (03PS10) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) [19:13:26] 10SRE-OnFire, 10Data-Platform-SRE, 10Discovery-Search, 10Wikimedia-Incident: 2023-09-20 Elasticsearch unavailable incident - https://phabricator.wikimedia.org/T346945 (10bking) [19:13:28] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:13:37] (03PS11) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) [19:14:24] (03CR) 10Majavah: "This is causing Puppet to fail on some Cloud VPS hosts:" [puppet] - 10https://gerrit.wikimedia.org/r/959226 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [19:14:50] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:16:22] (03PS12) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) [19:16:39] (03PS1) 10RLazarus: httpbb: Switch to a different entity for testwikidata [puppet] - 10https://gerrit.wikimedia.org/r/960693 [19:19:29] (03PS13) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) [19:20:50] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [19:26:16] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:29:49] (03PS14) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) [19:30:14] (03CR) 10CI reject: [V: 04-1] designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) (owner: 10Andrew Bogott) [19:32:42] (03PS15) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) [19:35:25] (03CR) 10BCornwall: [V: 03+1 C: 03+2] mtail: Record bad requests for varnish SLI metrics [puppet] - 10https://gerrit.wikimedia.org/r/953725 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall) [19:37:15] (03PS16) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) [19:39:43] (03PS17) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) [19:42:23] (03PS18) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) [19:50:46] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:52:14] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:55:14] (03CR) 10AOkoth: [C: 03+2] wikistats: drop some updates [puppet] - 10https://gerrit.wikimedia.org/r/956813 (owner: 10RhinosF1) [19:55:48] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:56:16] (03PS3) 10JHathaway: puppetserver: add comment on avoiding perma-diff for /var/lib/puppet/ssl [puppet] - 10https://gerrit.wikimedia.org/r/959235 (https://phabricator.wikimedia.org/T337970) [19:57:43] (03CR) 10JHathaway: [C: 03+2] "thanks for the review" [puppet] - 10https://gerrit.wikimedia.org/r/959235 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway) [19:59:24] (03PS3) 10DDesouza: Deploy Reader Demographics 2 pilot survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959826 (https://phabricator.wikimedia.org/T345951) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230925T2000). [20:00:06] danisztls and houseofm: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:06] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:00:18] o/ [20:01:03] hi i can deploy [20:02:01] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959826 (https://phabricator.wikimedia.org/T345951) (owner: 10DDesouza) [20:03:40] (03Merged) 10jenkins-bot: Deploy Reader Demographics 2 pilot survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959826 (https://phabricator.wikimedia.org/T345951) (owner: 10DDesouza) [20:03:57] !log cjming@deploy2002 Started scap: Backport for [[gerrit:959826|Deploy Reader Demographics 2 pilot survey (T345951)]] [20:04:05] T345951: Deploy pilot on enwiki for Global Readers Demographic Survey - https://phabricator.wikimedia.org/T345951 [20:06:46] also here sorry but it in wrong deploy window it seems? [20:07:15] ^ cjming i've addded now [20:07:30] cjming: this change will be difficult to test as it only increases coverage [20:07:30] hi Jdlrobson :) sounds good [20:08:16] danisztls: should i go ahead and sync? or do you want to try to test? [20:09:57] cjming: yep, go ahead [20:12:50] Hi cjming I added a simple config patch too (if you have time after these deployments) :) [20:12:55] (03CR) 10Subramanya Sastry: [C: 03+1] Re-enable Extension:ParserMigration on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944978 (https://phabricator.wikimedia.org/T333179) (owner: 10Sbailey) [20:13:20] Superpes: sure np [20:14:07] Thanks ;) [20:15:54] !log cjming@deploy2002 cjming and dani: Backport for [[gerrit:959826|Deploy Reader Demographics 2 pilot survey (T345951)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:16:01] T345951: Deploy pilot on enwiki for Global Readers Demographic Survey - https://phabricator.wikimedia.org/T345951 [20:16:03] !log cjming@deploy2002 cjming and dani: Continuing with sync [20:19:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:19:54] houseofm: are you around for your patch? [20:21:44] cjming: thanks! [20:22:45] danisztls: :) should be live shortly [20:24:34] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (PUT deployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:25:15] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:959826|Deploy Reader Demographics 2 pilot survey (T345951)]] (duration: 21m 18s) [20:25:29] T345951: Deploy pilot on enwiki for Global Readers Demographic Survey - https://phabricator.wikimedia.org/T345951 [20:25:41] Jdlrobson: i'll do yours next [20:25:46] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [20:25:48] cool! [20:26:53] (03CR) 10Clare Ming: [C: 03+2] Provide wordmarks/taglines for Wikibooks projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959872 (https://phabricator.wikimedia.org/T341251) (owner: 10Jdlrobson) [20:27:14] (03PS2) 10JHathaway: prometheus-postgres-exporter: install configs before service [puppet] - 10https://gerrit.wikimedia.org/r/959230 (https://phabricator.wikimedia.org/T346842) [20:27:22] (03PS4) 10Clare Ming: Fix white background for Wikibooks wordmarks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957908 (https://phabricator.wikimedia.org/T341251) (owner: 10Pikne) [20:28:01] (03CR) 10JHathaway: prometheus-postgres-exporter: install configs before service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959230 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [20:28:13] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959230 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [20:30:40] (03PS3) 10Clare Ming: Provide wordmarks/taglines for Wikibooks projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959872 (https://phabricator.wikimedia.org/T341251) (owner: 10Jdlrobson) [20:32:45] (03CR) 10Clare Ming: [C: 03+2] Provide wordmarks/taglines for Wikibooks projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959872 (https://phabricator.wikimedia.org/T341251) (owner: 10Jdlrobson) [20:33:58] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:34:19] Jdlrobson: i'm trying to manually +2 your patches (rebase in between) so i can scap backport them together [20:34:59] (03CR) 10Clare Ming: [C: 03+2] Fix white background for Wikibooks wordmarks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957908 (https://phabricator.wikimedia.org/T341251) (owner: 10Pikne) [20:35:26] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:35:43] (03Merged) 10jenkins-bot: Fix white background for Wikibooks wordmarks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957908 (https://phabricator.wikimedia.org/T341251) (owner: 10Pikne) [20:35:46] (03Merged) 10jenkins-bot: Provide wordmarks/taglines for Wikibooks projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959872 (https://phabricator.wikimedia.org/T341251) (owner: 10Jdlrobson) [20:36:33] (03PS7) 10Clare Ming: Icons for special projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956502 (https://phabricator.wikimedia.org/T341242) (owner: 10Jdlrobson) [20:37:46] (03CR) 10Clare Ming: [C: 03+2] Icons for special projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956502 (https://phabricator.wikimedia.org/T341242) (owner: 10Jdlrobson) [20:38:03] cjming: sounds good [20:38:14] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:38:27] (03Merged) 10jenkins-bot: Icons for special projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956502 (https://phabricator.wikimedia.org/T341242) (owner: 10Jdlrobson) [20:39:01] !log cjming@deploy2002 Started scap: Backport for [[gerrit:959872|Provide wordmarks/taglines for Wikibooks projects (T341251)]], [[gerrit:957908|Fix white background for Wikibooks wordmarks (T341251)]], [[gerrit:956502|Icons for special projects (T341242)]] [20:39:11] T341242: Design: Get icons for Wikimedia special wikis (including chapters) - https://phabricator.wikimedia.org/T341242 [20:39:11] T341251: Deploy wordmarks/taglines for Wikibooks projects - https://phabricator.wikimedia.org/T341251 [20:39:26] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:39:38] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:40:02] 10SRE-OnFire, 10Data-Platform-SRE, 10Discovery-Search, 10Wikimedia-Incident: 2023-09-20 Elasticsearch unavailable incident - https://phabricator.wikimedia.org/T346945 (10bking) [20:50:21] (03CR) 10Fabfur: vanish: allow PURGE requests only from dedicated socket (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192) (owner: 10Fabfur) [20:51:06] (03PS9) 10Fabfur: vanish: allow PURGE requests only from dedicated socket [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192) [20:51:09] !log cjming@deploy2002 pikne and cjming and jdlrobson: Backport for [[gerrit:959872|Provide wordmarks/taglines for Wikibooks projects (T341251)]], [[gerrit:957908|Fix white background for Wikibooks wordmarks (T341251)]], [[gerrit:956502|Icons for special projects (T341242)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes [20:51:09] deployment (accessible via k8s-experimental XWD option) [20:51:14] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:51:19] T341242: Design: Get icons for Wikimedia special wikis (including chapters) - https://phabricator.wikimedia.org/T341242 [20:51:19] Jdlrobson: are you able to test? [20:51:19] T341251: Deploy wordmarks/taglines for Wikibooks projects - https://phabricator.wikimedia.org/T341251 [20:51:50] cjming: yep looking now [20:53:18] @cjming LGTM! please sync! [20:53:28] yay - syncing [20:53:34] !log cjming@deploy2002 pikne and cjming and jdlrobson: Continuing with sync [20:54:09] Jdlrobson: i'm assuming i need to purge the files in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/957908 ? [20:58:56] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:59:15] cjming: yep i believe so [20:59:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:00:05] Reedy, sbassett, Maryum, and manfredi: (Dis)respected human, time to deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230925T2100). Please do the needful. [21:00:24] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:01:04] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:02:52] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:959872|Provide wordmarks/taglines for Wikibooks projects (T341251)]], [[gerrit:957908|Fix white background for Wikibooks wordmarks (T341251)]], [[gerrit:956502|Icons for special projects (T341242)]] (duration: 23m 50s) [21:03:09] T341242: Design: Get icons for Wikimedia special wikis (including chapters) - https://phabricator.wikimedia.org/T341242 [21:03:10] T341251: Deploy wordmarks/taglines for Wikibooks projects - https://phabricator.wikimedia.org/T341251 [21:03:54] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:04:11] Jdlrobson: ok - should be live - and i just purged the svgs [21:04:34] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:04:41] Superpes: if you're still around, i'll do yours now [21:04:58] (03PS3) 10Clare Ming: [fiwiki] Add an editautoreviewprotected level protecion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960201 (https://phabricator.wikimedia.org/T347069) (owner: 10Superpes15) [21:08:32] houseofm: Superpes: i'll hang out for a few minutes after which i'll close this backport window [21:09:34] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:12:22] !log end of UTC late backport window [21:12:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:39] I'm going to test some scap changes on the deploy server now. [21:17:09] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [21:19:12] (03PS1) 10JHathaway: nginx: mount lib on tmpfs vol in cloud [puppet] - 10https://gerrit.wikimedia.org/r/960708 (https://phabricator.wikimedia.org/T346842) [21:19:21] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt wdqs1017-20 - jclark@cumin1001" [21:20:11] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt wdqs1017-20 - jclark@cumin1001" [21:20:11] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:20:47] (03CR) 10JHathaway: [C: 03+2] nginx: add toggle for mounting lib on tmpfs vol (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959226 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [21:21:09] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/960708 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [21:22:03] !log dancy@deploy2002 Started scap: testing scap mods [21:24:08] (PuppetConstantChange) firing: (2) Puppet performing a change on every puppet run - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [21:25:24] (03CR) 10JHathaway: [C: 03+2] nginx: mount lib on tmpfs vol in cloud [puppet] - 10https://gerrit.wikimedia.org/r/960708 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [21:27:13] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [21:29:18] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt wdqs1017-20 - jclark@cumin1001" [21:29:51] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host wdqs1022.mgmt.eqiad.wmnet with reboot policy FORCED [21:29:52] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host wdqs1017.mgmt.eqiad.wmnet with reboot policy FORCED [21:29:55] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host wdqs1023.mgmt.eqiad.wmnet with reboot policy FORCED [21:29:58] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host wdqs1024.mgmt.eqiad.wmnet with reboot policy FORCED [21:30:02] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt wdqs1017-20 - jclark@cumin1001" [21:30:02] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:30:38] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host wdqs1018.mgmt.eqiad.wmnet with reboot policy FORCED [21:30:51] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host wdqs1019.mgmt.eqiad.wmnet with reboot policy FORCED [21:31:04] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host wdqs1024.mgmt.eqiad.wmnet with reboot policy FORCED [21:31:29] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wdqs1024.mgmt.eqiad.wmnet with reboot policy FORCED [21:31:31] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host wdqs1021.mgmt.eqiad.wmnet with reboot policy FORCED [21:32:39] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host wdqs1024.mgmt.eqiad.wmnet with reboot policy FORCED [21:32:41] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wdqs1024.mgmt.eqiad.wmnet with reboot policy FORCED [21:33:08] (03PS19) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) [21:35:52] (03PS20) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) [21:36:43] !log dancy@deploy2002 Installing scap version "4.62.0" for 598 hosts [21:37:51] !log dancy@deploy2002 Installation of scap version "4.62.0" completed for 598 hosts [21:38:32] !log dancy@deploy2002 Started scap: testing scap mods [21:40:50] RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:41:06] (03PS21) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) [21:45:06] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: imagecatalog_record.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:45:25] !log dancy@deploy2002 Started scap: testing scap mods [21:46:33] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1018.mgmt.eqiad.wmnet with reboot policy FORCED [21:46:48] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:46:52] !log dancy@deploy2002 Started scap: final test sync [21:47:18] (03PS22) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) [21:47:28] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10Jclark-ctr) [21:47:48] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1023.mgmt.eqiad.wmnet with reboot policy FORCED [21:47:57] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1021.mgmt.eqiad.wmnet with reboot policy FORCED [21:48:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10Jclark-ctr) [21:48:27] (03PS1) 10Ryan Kemper: elastic: don't alert p95 if request volume low [puppet] - 10https://gerrit.wikimedia.org/r/960712 (https://phabricator.wikimedia.org/T347341) [21:48:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10Jclark-ctr) [21:49:14] PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:49:40] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:50:38] RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:51:09] (03PS23) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) [21:51:30] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1019.mgmt.eqiad.wmnet with reboot policy FORCED [21:51:32] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1017.mgmt.eqiad.wmnet with reboot policy FORCED [21:52:12] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10Jclark-ctr) [21:52:29] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1022.mgmt.eqiad.wmnet with reboot policy FORCED [21:52:31] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1024.mgmt.eqiad.wmnet with reboot policy FORCED [21:52:42] (03PS24) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) [21:53:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10Jclark-ctr) [21:53:41] (03CR) 10Bking: [C: 03+1] elastic: don't alert p95 if request volume low [puppet] - 10https://gerrit.wikimedia.org/r/960712 (https://phabricator.wikimedia.org/T347341) (owner: 10Ryan Kemper) [21:54:00] 10SRE-OnFire, 10Data-Platform-SRE, 10Discovery-Search, 10Wikimedia-Incident: 2023-09-20 Elasticsearch unavailable incident - https://phabricator.wikimedia.org/T346945 (10bking) [21:54:38] (03PS25) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) [21:56:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10Jclark-ctr) [21:57:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PUT deployments) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:57:49] (03CR) 10Ryan Kemper: [C: 03+2] elastic: don't alert p95 if request volume low [puppet] - 10https://gerrit.wikimedia.org/r/960712 (https://phabricator.wikimedia.org/T347341) (owner: 10Ryan Kemper) [21:58:04] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1021.eqiad.wmnet'] [21:58:10] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1022.eqiad.wmnet'] [21:58:37] (03PS1) 10Andrew Bogott: Update fake password keys for mysql::dump [labs/private] - 10https://gerrit.wikimedia.org/r/960713 [21:58:49] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1024.eqiad.wmnet'] [21:58:56] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1022.eqiad.wmnet'] [21:59:00] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1023.eqiad.wmnet'] [21:59:27] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1018.eqiad.wmnet'] [21:59:39] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1017.eqiad.wmnet'] [21:59:52] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs1022.eqiad.wmnet'] [21:59:53] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs1022.eqiad.wmnet'] [22:00:33] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Update fake password keys for mysql::dump [labs/private] - 10https://gerrit.wikimedia.org/r/960713 (owner: 10Andrew Bogott) [22:01:53] !log dancy@deploy2002 Finished scap: final test sync (duration: 15m 00s) [22:02:34] (KubernetesAPILatency) resolved: (5) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:03:06] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1019.eqiad.wmnet'] [22:04:27] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1022'] [22:04:41] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs1022'] [22:05:14] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1022'] [22:05:25] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs1022'] [22:07:01] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1022'] [22:07:10] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs1022'] [22:07:26] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1024.eqiad.wmnet'] [22:07:43] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1021.eqiad.wmnet'] [22:07:45] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1018.eqiad.wmnet'] [22:07:59] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1017.eqiad.wmnet'] [22:08:53] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1017.eqiad.wmnet with OS bullseye [22:08:59] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1017.eqiad.wmnet with OS bullseye [22:09:21] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1023.eqiad.wmnet'] [22:11:29] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1019.eqiad.wmnet'] [22:11:42] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1018.eqiad.wmnet with OS bullseye [22:11:43] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1019.eqiad.wmnet with OS bullseye [22:11:49] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1018.eqiad.wmnet with OS bullseye [22:11:52] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1019.eqiad.wmnet with OS bullseye [22:13:34] cjming Sorry my internet completely died will re-schedule it for tomorrow :/ [22:13:57] Many thanks for your availability btw :D [22:13:58] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1021.eqiad.wmnet with OS bullseye [22:14:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1021.eqiad.wmnet with OS bullseye [22:14:04] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1022.eqiad.wmnet with OS bullseye [22:14:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1022.eqiad.wmnet with OS bullseye [22:14:12] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1023.eqiad.wmnet with OS bullseye [22:14:18] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1024.eqiad.wmnet with OS bullseye [22:14:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1023.eqiad.wmnet with OS bullseye [22:14:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1024.eqiad.wmnet with OS bullseye [22:15:16] (03PS1) 10Ryan Kemper: Revert "elastic: don't alert p95 if request volume low" [puppet] - 10https://gerrit.wikimedia.org/r/960728 [22:16:31] (03PS2) 10Ryan Kemper: Revert "elastic: don't alert p95 if request volume low" [puppet] - 10https://gerrit.wikimedia.org/r/960728 (https://phabricator.wikimedia.org/T347341) [22:16:48] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] Revert "elastic: don't alert p95 if request volume low" [puppet] - 10https://gerrit.wikimedia.org/r/960728 (https://phabricator.wikimedia.org/T347341) (owner: 10Ryan Kemper) [22:20:50] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:21:11] (03PS26) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) [22:22:18] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:23:15] (03PS1) 10Ryan Kemper: elastic: don't alert p95 if request volume low [puppet] - 10https://gerrit.wikimedia.org/r/960717 (https://phabricator.wikimedia.org/T347341) [22:23:44] (JobUnavailable) firing: (4) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:24:38] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] elastic: don't alert p95 if request volume low [puppet] - 10https://gerrit.wikimedia.org/r/960717 (https://phabricator.wikimedia.org/T347341) (owner: 10Ryan Kemper) [22:25:42] (03PS27) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) [22:35:17] (03PS28) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) [22:35:44] (03CR) 10CI reject: [V: 04-1] designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) (owner: 10Andrew Bogott) [22:37:51] (03PS29) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) [22:39:02] (03PS1) 10Ryan Kemper: elastic: standardize eqiad & codfw p95 metrics [puppet] - 10https://gerrit.wikimedia.org/r/960721 (https://phabricator.wikimedia.org/T347341) [22:39:24] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:40:13] (03CR) 10Ebernhardson: [C: 03+1] elastic: standardize eqiad & codfw p95 metrics [puppet] - 10https://gerrit.wikimedia.org/r/960721 (https://phabricator.wikimedia.org/T347341) (owner: 10Ryan Kemper) [22:40:25] (03CR) 10Ryan Kemper: [C: 03+2] elastic: standardize eqiad & codfw p95 metrics [puppet] - 10https://gerrit.wikimedia.org/r/960721 (https://phabricator.wikimedia.org/T347341) (owner: 10Ryan Kemper) [22:40:50] (03PS30) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) [22:40:57] (03PS1) 10Andrea Denisse: prometheus: Enable selective scraping for Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/960723 (https://phabricator.wikimedia.org/T346656) [22:46:13] 10SRE-OnFire, 10Data-Platform-SRE, 10Discovery-Search, 10Wikimedia-Incident: 2023-09-20 Elasticsearch unavailable incident - https://phabricator.wikimedia.org/T346945 (10RKemper) [22:48:13] (03CR) 10Andrea Denisse: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/958807 (https://phabricator.wikimedia.org/T346656) (owner: 10Andrea Denisse) [22:49:43] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/958807/43582/" [puppet] - 10https://gerrit.wikimedia.org/r/958807 (https://phabricator.wikimedia.org/T346656) (owner: 10Andrea Denisse) [22:51:14] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:51:30] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:52:00] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:52:56] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:53:26] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:57:15] (03CR) 10Andrea Denisse: "Patch #958807 must be merged and applied on all hosts before merging and applying." [puppet] - 10https://gerrit.wikimedia.org/r/960723 (https://phabricator.wikimedia.org/T346656) (owner: 10Andrea Denisse) [23:11:08] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:12:16] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:15:48] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [23:15:48] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:18:14] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:26:16] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:29:08] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1017.eqiad.wmnet with OS bullseye [23:29:15] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1017.eqiad.wmnet with OS bullseye executed with errors: - wdqs1017 (**FAIL**) - Remove... [23:31:56] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1018.eqiad.wmnet with OS bullseye [23:31:59] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1019.eqiad.wmnet with OS bullseye [23:32:02] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1018.eqiad.wmnet with OS bullseye executed with errors: - wdqs1018 (**FAIL**) - Remove... [23:32:06] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1019.eqiad.wmnet with OS bullseye executed with errors: - wdqs1019 (**FAIL**) - Remove... [23:33:23] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1017'] [23:33:45] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs1017'] [23:33:53] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1017'] [23:34:12] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1021.eqiad.wmnet with OS bullseye [23:34:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1021.eqiad.wmnet with OS bullseye executed with errors: - wdqs1021 (**FAIL**... [23:34:20] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1022.eqiad.wmnet with OS bullseye [23:34:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1022.eqiad.wmnet with OS bullseye executed with errors: - wdqs1022 (**FAIL**... [23:34:25] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1023.eqiad.wmnet with OS bullseye [23:34:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1023.eqiad.wmnet with OS bullseye executed with errors: - wdqs1023 (**FAIL**... [23:34:32] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1024.eqiad.wmnet with OS bullseye [23:34:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1024.eqiad.wmnet with OS bullseye executed with errors: - wdqs1024 (**FAIL**... [23:34:42] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1018'] [23:34:52] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1019'] [23:35:00] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1021'] [23:35:08] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1023'] [23:35:27] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1023'] [23:35:29] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1024'] [23:35:49] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs1018'] [23:35:50] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs1019'] [23:35:55] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1023'] [23:35:56] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1023'] [23:36:10] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1023'] [23:36:18] PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:36:20] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1023'] [23:36:40] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1023'] [23:36:55] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1018'] [23:37:15] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1019'] [23:37:36] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1023'] [23:37:41] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1023'] [23:37:42] RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:37:43] (03PS31) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) [23:37:49] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1023'] [23:38:08] (03CR) 10CI reject: [V: 04-1] designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) (owner: 10Andrew Bogott) [23:40:04] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1023'] [23:40:08] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1023'] [23:40:13] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1023'] [23:40:18] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1023'] [23:40:32] PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:40:51] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1017'] [23:41:26] (03PS32) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) [23:41:50] (03CR) 10CI reject: [V: 04-1] designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) (owner: 10Andrew Bogott) [23:42:17] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1021'] [23:42:39] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1023'] [23:42:48] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1023'] [23:42:58] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1023'] [23:42:58] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1024'] [23:43:04] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1023'] [23:43:11] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1022'] [23:43:33] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1023'] [23:43:38] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1023'] [23:43:57] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1017.eqiad.wmnet with OS bullseye [23:44:03] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1017.eqiad.wmnet with OS bullseye [23:44:26] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1018'] [23:44:28] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1019'] [23:44:46] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1018.eqiad.wmnet with OS bullseye [23:44:48] RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:44:50] (03PS33) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) [23:44:52] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1019.eqiad.wmnet with OS bullseye [23:44:55] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1018.eqiad.wmnet with OS bullseye [23:44:59] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1019.eqiad.wmnet with OS bullseye [23:45:03] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1024.eqiad.wmnet with OS bullseye [23:45:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1024.eqiad.wmnet with OS bullseye [23:45:17] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1021.eqiad.wmnet with OS bullseye [23:45:20] (03CR) 10CI reject: [V: 04-1] designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) (owner: 10Andrew Bogott) [23:45:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1021.eqiad.wmnet with OS bullseye [23:45:29] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1023'] [23:45:36] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1023'] [23:48:52] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1022'] [23:49:35] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1022.eqiad.wmnet with OS bullseye [23:49:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1022.eqiad.wmnet with OS bullseye [23:50:44] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1023'] [23:50:51] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1023'] [23:51:31] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host wdqs1023.mgmt.eqiad.wmnet with reboot policy FORCED [23:55:22] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wdqs1023.mgmt.eqiad.wmnet with reboot policy FORCED [23:56:36] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:57:56] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase