[00:00:04] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6009 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:06] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5016 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:06] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5013 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:08] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5012 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:10] <jinxer-wm>	 (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported
[00:00:14] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5002 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:18] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5010 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:30] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5005 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:30] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5015 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:34] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3051 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:34] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3056 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:36] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3057 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:38] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6007 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:38] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5009 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:40] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:00:42] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6010 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:42] <wikibugs>	 (03CR) 10Cwhite: opensearch: make upgrade-phatality.sh stricter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849631 (https://phabricator.wikimedia.org/T304440) (owner: 10Hashar)
[00:00:44] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6016 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:58] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6015 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:02] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5003 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:06] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3064 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:06] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3054 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:06] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3055 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:06] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6001 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:12] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3050 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:14] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6013 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:16] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3058 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:20] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3060 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:20] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3063 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:22] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6005 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:28] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6012 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:30] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5004 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:34] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5007 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:34] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5006 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:36] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3061 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:38] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6008 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:44] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6002 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:50] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5011 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:50] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5014 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:54] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6004 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:56] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6003 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:58] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5008 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:58] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3062 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:59] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6006 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:02:02] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3065 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:02:04] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3052 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:02:04] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6014 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:02:04] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6011 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:02:04] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3059 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:02:06] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3053 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:04:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[00:04:42] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s3 on db1154 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:05:50] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4044.ulsfo.wmnet with reason: host reimage
[00:06:52] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s8 on db1154 is OK: OK slave_sql_lag Replication lag: 0.10 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:07:06] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s8 on clouddb1016 is OK: OK slave_sql_lag Replication lag: 0.19 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:09:23] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4044.ulsfo.wmnet with reason: host reimage
[00:14:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[00:18:51] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:19:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[00:33:19] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:35:49] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4044.ulsfo.wmnet with OS buster
[00:40:49] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[00:54:23] <wikibugs>	 (03PS1) 10Ssingh: cp4043: update site.pp and related configs for cp (text) role [puppet] - 10https://gerrit.wikimedia.org/r/849715 (https://phabricator.wikimedia.org/T317244)
[00:57:46] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] cp4043: update site.pp and related configs for cp (text) role [puppet] - 10https://gerrit.wikimedia.org/r/849715 (https://phabricator.wikimedia.org/T317244) (owner: 10Ssingh)
[00:59:16] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4043.ulsfo.wmnet with OS buster
[00:59:23] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp4043.ulsfo.wmnet with OS buster
[01:03:18] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2049 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:10:05] <wikibugs>	 (03PS3) 10BCornwall: varnish: Conditionally set WMF-Last-Access cookie [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996)
[01:10:54] <wikibugs>	 (03CR) 10BCornwall: varnish: Conditionally set WMF-Last-Access cookie (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996) (owner: 10BCornwall)
[01:11:38] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4043.ulsfo.wmnet with OS buster
[01:11:45] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp4043.ulsfo.wmnet with OS buster executed with errors: - cp4043 (**FA...
[01:15:58] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4043.ulsfo.wmnet with OS buster
[01:16:06] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp4043.ulsfo.wmnet with OS buster
[01:23:48] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[01:25:34] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[01:25:48] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:28:34] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:31:36] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2049 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:33:32] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[01:33:54] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[01:35:05] <wikibugs>	 (03PS1) 10Tim Starling: In Language::ucfirst(), use title case instead of upper case [core] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/849670 (https://phabricator.wikimedia.org/T292552)
[01:35:57] <wikibugs>	 (03PS1) 10Tim Starling: In Language::ucfirst(), use title case instead of upper case [core] (wmf/1.40.0-wmf.7) - 10https://gerrit.wikimedia.org/r/849671 (https://phabricator.wikimedia.org/T292552)
[01:36:19] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] In Language::ucfirst(), use title case instead of upper case [core] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/849670 (https://phabricator.wikimedia.org/T292552) (owner: 10Tim Starling)
[01:36:24] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] In Language::ucfirst(), use title case instead of upper case [core] (wmf/1.40.0-wmf.7) - 10https://gerrit.wikimedia.org/r/849671 (https://phabricator.wikimedia.org/T292552) (owner: 10Tim Starling)
[01:36:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job workhorse in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:36:52] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on db1154 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[01:37:32] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[01:38:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (15) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[01:41:45] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:41:57] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4043.ulsfo.wmnet with reason: host reimage
[01:45:24] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4043.ulsfo.wmnet with reason: host reimage
[01:46:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:51:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:52:12] <wikibugs>	 (03Merged) 10jenkins-bot: In Language::ucfirst(), use title case instead of upper case [core] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/849670 (https://phabricator.wikimedia.org/T292552) (owner: 10Tim Starling)
[01:53:12] <wikibugs>	 (03Merged) 10jenkins-bot: In Language::ucfirst(), use title case instead of upper case [core] (wmf/1.40.0-wmf.7) - 10https://gerrit.wikimedia.org/r/849671 (https://phabricator.wikimedia.org/T292552) (owner: 10Tim Starling)
[01:56:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[01:57:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[01:57:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[01:58:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[02:00:05] <wikibugs>	 (03PS1) 10Tim Starling: Temporary identity mappings for title case ligatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849724 (https://phabricator.wikimedia.org/T292552)
[02:00:49] <wikibugs>	 (03PS25) 10Andrew Bogott: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[02:02:29] <icinga-wm>	 PROBLEM - Check systemd state on mx2001 is CRITICAL: CRITICAL - degraded: The following units failed: generate_otrs_aliases.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:03:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[02:03:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[02:03:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[02:04:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[02:06:45] <jinxer-wm>	 (JobUnavailable) firing: (7) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:06:51] <logmsgbot>	 !log tstarling@deploy1002 Synchronized php-1.40.0-wmf.6/includes/language/Language.php: T292552 (duration: 03m 40s)
[02:06:56] <stashbot>	 T292552: Rename articles and users to update our case mapping to PHP 7.4 and Unicode 11 - https://phabricator.wikimedia.org/T292552
[02:10:30] <logmsgbot>	 !log tstarling@deploy1002 Synchronized php-1.40.0-wmf.7/includes/language/Language.php: T292552 (duration: 03m 39s)
[02:11:45] <jinxer-wm>	 (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:13:09] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4043.ulsfo.wmnet with OS buster
[02:13:17] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp4043.ulsfo.wmnet with OS buster completed: - cp4043 (**PASS**)   - R...
[02:15:07] <icinga-wm>	 PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[02:15:15] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 04-1] "One wrinkle that I think your code isn't considering: the PAWS and Tools will have different NFS servers. Each can run a copy of the same " [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[02:25:38] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Temporary identity mappings for title case ligatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849724 (https://phabricator.wikimedia.org/T292552) (owner: 10Tim Starling)
[02:26:20] <wikibugs>	 (03Merged) 10jenkins-bot: Temporary identity mappings for title case ligatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849724 (https://phabricator.wikimedia.org/T292552) (owner: 10Tim Starling)
[02:29:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[02:30:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[02:30:40] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[02:30:57] <logmsgbot>	 !log tstarling@deploy1002 Synchronized wmf-config/UcfirstOverrides.php: T292552 allow title case ligatures (duration: 03m 36s)
[02:31:03] <stashbot>	 T292552: Rename articles and users to update our case mapping to PHP 7.4 and Unicode 11 - https://phabricator.wikimedia.org/T292552
[02:31:14] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:31:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[02:36:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[02:37:58] <icinga-wm>	 RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 14.30 ms
[02:38:10] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[02:40:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[02:40:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[02:41:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[02:45:03] <wikibugs>	 (03PS5) 10Tim Starling: Migrate to PHP 7.4 title case mapping, but retain Eszett override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842243 (https://phabricator.wikimedia.org/T292552)
[02:46:56] <icinga-wm>	 PROBLEM - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:47:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[02:49:44] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Migrate to PHP 7.4 title case mapping, but retain Eszett override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842243 (https://phabricator.wikimedia.org/T292552) (owner: 10Tim Starling)
[02:50:27] <wikibugs>	 (03Merged) 10jenkins-bot: Migrate to PHP 7.4 title case mapping, but retain Eszett override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842243 (https://phabricator.wikimedia.org/T292552) (owner: 10Tim Starling)
[02:53:14] <icinga-wm>	 PROBLEM - SSH on mw1310.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:56:11] <logmsgbot>	 !log tstarling@deploy1002 Synchronized wmf-config/UcfirstOverrides.php: T292552 final configuration (duration: 03m 54s)
[02:56:17] <stashbot>	 T292552: Rename articles and users to update our case mapping to PHP 7.4 and Unicode 11 - https://phabricator.wikimedia.org/T292552
[02:56:40] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[02:57:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[02:57:40] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[02:58:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[02:59:10] <icinga-wm>	 RECOVERY - Check systemd state on mx2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:29:32] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:30:02] <icinga-wm>	 PROBLEM - Check systemd state on wcqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:32:38] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[03:47:52] <icinga-wm>	 RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:00:10] <jinxer-wm>	 (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported
[04:04:16] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[04:14:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[04:31:01] <wikibugs>	 (03PS9) 10Giuseppe Lavagetto: Add cookbook to restart pybal [cookbooks] - 10https://gerrit.wikimedia.org/r/848949
[04:31:10] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add cookbook to restart pybal [cookbooks] - 10https://gerrit.wikimedia.org/r/848949 (owner: 10Giuseppe Lavagetto)
[04:34:49] <wikibugs>	 (03Merged) 10jenkins-bot: Add cookbook to restart pybal [cookbooks] - 10https://gerrit.wikimedia.org/r/848949 (owner: 10Giuseppe Lavagetto)
[04:46:44] <icinga-wm>	 PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[04:52:44] <icinga-wm>	 RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.87 ms
[04:53:38] <icinga-wm>	 RECOVERY - Check systemd state on wcqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:59:42] <icinga-wm>	 PROBLEM - Check systemd state on wcqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:01:08] <icinga-wm>	 PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[05:07:14] <icinga-wm>	 RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms
[05:09:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[05:20:50] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 25 hosts with reason: Primary switchover s5 T321178
[05:20:56] <stashbot>	 T321178: Switchover s5 master (db1100 -> db1130) - https://phabricator.wikimedia.org/T321178
[05:21:19] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s5 T321178
[05:21:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db1130 with weight 0 T321178', diff saved to https://phabricator.wikimedia.org/P36636 and previous config saved to /var/cache/conftool/dbconfig/20221027-052127-ladsgroup.json
[05:22:52] <wikibugs>	 (03CR) 10Sohom Datta: [C: 03+1] "Will schedule this for 6:30-7:30 (IST) today" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849672 (https://phabricator.wikimedia.org/T53980) (owner: 10Bodhisattwa)
[05:24:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[05:24:22] <icinga-wm>	 RECOVERY - Check systemd state on wcqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:26:34] <wikibugs>	 (03PS4) 10DDesouza: Deploy Research Incentive survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834048 (https://phabricator.wikimedia.org/T318333)
[05:28:25] <marostegui>	 !log dbmaint Switch x1 to SBR T318518
[05:28:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:28:31] <stashbot>	 T318518: Add `gemm_mentee_is_active` column to growthexperiments_mentor_mentee x1 table - https://phabricator.wikimedia.org/T318518
[05:30:30] <icinga-wm>	 PROBLEM - Check systemd state on wcqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:30:52] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: db1154 is not coming back after restart - https://phabricator.wikimedia.org/T321562 (10Marostegui) Thanks a lot John
[05:35:08] <marostegui>	 !log Deploy schema change on x1 T318518
[05:35:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:35:14] <stashbot>	 T318518: Add `gemm_mentee_is_active` column to growthexperiments_mentor_mentee x1 table - https://phabricator.wikimedia.org/T318518
[05:39:14] <jinxer-wm>	 (KubernetesAPILatency) firing: (15) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[05:44:06] <wikibugs>	 (03PS1) 10Marostegui: Revert "clouddb1013, clouddb1017: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/849673
[05:44:57] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "clouddb1013, clouddb1017: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/849673 (owner: 10Marostegui)
[05:45:02] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 132 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[05:46:10] <icinga-wm>	 PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[05:46:30] <wikibugs>	 (03PS2) 10Ladsgroup: mariadb: Promote db1130 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/844017 (https://phabricator.wikimedia.org/T321178) (owner: 10Gerrit maintenance bot)
[05:46:35] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db1130 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/844017 (https://phabricator.wikimedia.org/T321178) (owner: 10Gerrit maintenance bot)
[05:47:02] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[05:51:58] <icinga-wm>	 PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:52:17] <wikibugs>	 (03PS1) 10Marostegui: Revert "mariadb: Switch x1 to STATEMENT" [puppet] - 10https://gerrit.wikimedia.org/r/849674
[05:52:18] <icinga-wm>	 RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.62 ms
[05:52:58] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "mariadb: Switch x1 to STATEMENT" [puppet] - 10https://gerrit.wikimedia.org/r/849674 (owner: 10Marostegui)
[05:55:19] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] coredns: add standard labels to resources (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/849562 (https://phabricator.wikimedia.org/T321159) (owner: 10Elukey)
[05:57:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[05:57:04] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] kubernetes: Actually use the master_fqdn instead of the cert name [puppet] - 10https://gerrit.wikimedia.org/r/849543 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[06:00:05] <jouncebot>	 kormat, marostegui, and Amir1: My dear minions, it's time we take the moon! Just kidding. Time for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221027T0600).
[06:00:22] <Amir1>	 let's go
[06:00:28] <marostegui>	 yep
[06:00:36] <icinga-wm>	 PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[06:00:55] <Amir1>	 !log Starting s5 eqiad failover from db1100 to db1130 - T321178
[06:01:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:01:01] <stashbot>	 T321178: Switchover s5 master (db1100 -> db1130) - https://phabricator.wikimedia.org/T321178
[06:01:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set s5 eqiad as read-only for maintenance - T321178', diff saved to https://phabricator.wikimedia.org/P36637 and previous config saved to /var/cache/conftool/dbconfig/20221027-060102-ladsgroup.json
[06:01:30] <icinga-wm>	 PROBLEM - Host elastic2043 is DOWN: PING CRITICAL - Packet loss = 100%
[06:01:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db1130 to s5 primary and set section read-write T321178', diff saved to https://phabricator.wikimedia.org/P36638 and previous config saved to /var/cache/conftool/dbconfig/20221027-060137-ladsgroup.json
[06:02:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[06:05:23] <wikibugs>	 (03PS2) 10Ladsgroup: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/844018 (https://phabricator.wikimedia.org/T321178) (owner: 10Gerrit maintenance bot)
[06:05:40] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/844018 (https://phabricator.wikimedia.org/T321178) (owner: 10Gerrit maintenance bot)
[06:06:44] <icinga-wm>	 RECOVERY - Host parse1001.mgmt is UP: PING WARNING - Packet loss = 75%, RTA = 1.82 ms
[06:06:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1100 T321178', diff saved to https://phabricator.wikimedia.org/P36639 and previous config saved to /var/cache/conftool/dbconfig/20221027-060654-ladsgroup.json
[06:07:00] <stashbot>	 T321178: Switchover s5 master (db1100 -> db1130) - https://phabricator.wikimedia.org/T321178
[06:07:39] <wikibugs>	 (03PS1) 10Marostegui: add_cuc_user_ip_time_index_T321123.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/849915 (https://phabricator.wikimedia.org/T321123)
[06:08:08] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1100.eqiad.wmnet with reason: Maintenance
[06:08:11] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1100.eqiad.wmnet with reason: Maintenance
[06:11:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[06:23:39] <icinga-wm>	 RECOVERY - Check systemd state on wcqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:27:14] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance
[06:27:16] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance
[06:27:58] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[06:28:00] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[06:28:16] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance
[06:28:30] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance
[06:28:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2105 (T318950)', diff saved to https://phabricator.wikimedia.org/P36640 and previous config saved to /var/cache/conftool/dbconfig/20221027-062836-ladsgroup.json
[06:29:05] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1100.eqiad.wmnet with reason: Maintenance
[06:29:07] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1100.eqiad.wmnet with reason: Maintenance
[06:29:52] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance
[06:29:55] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance
[06:29:56] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[06:30:12] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[06:30:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T318950)', diff saved to https://phabricator.wikimedia.org/P36641 and previous config saved to /var/cache/conftool/dbconfig/20221027-063018-ladsgroup.json
[06:34:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T318950)', diff saved to https://phabricator.wikimedia.org/P36642 and previous config saved to /var/cache/conftool/dbconfig/20221027-063414-ladsgroup.json
[06:36:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T318950)', diff saved to https://phabricator.wikimedia.org/P36643 and previous config saved to /var/cache/conftool/dbconfig/20221027-063631-ladsgroup.json
[06:36:38] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[06:45:02] <wikibugs>	 (03PS26) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040)
[06:45:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[06:47:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[06:48:46] <icinga-wm>	 PROBLEM - Check systemd state on wcqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:49:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P36644 and previous config saved to /var/cache/conftool/dbconfig/20221027-064921-ladsgroup.json
[06:49:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti1009.eqiad.wmnet with reason: Remove from cluster for eventual reimage
[06:49:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti1009.eqiad.wmnet with reason: Remove from cluster for eventual reimage
[06:51:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P36645 and previous config saved to /var/cache/conftool/dbconfig/20221027-065138-ladsgroup.json
[06:52:42] <icinga-wm>	 RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:52:55] <wikibugs>	 (03CR) 10David Caro: Modify maintain-dbusers.py to call the rest-api service (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[06:53:52] <icinga-wm>	 RECOVERY - Check systemd state on wcqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:54:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1009.eqiad.wmnet with OS bullseye
[06:55:03] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1009.eqiad.wmnet with OS bullseye
[06:55:27] <wikibugs>	 (03PS1) 10David Caro: openstack.add_flavor: create cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/849957 (https://phabricator.wikimedia.org/T321657)
[06:57:06] <icinga-wm>	 RECOVERY - SSH on mw1310.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:59:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] openstack.add_flavor: create cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/849957 (https://phabricator.wikimedia.org/T321657) (owner: 10David Caro)
[07:00:04] <jouncebot>	 Amir1, apergos, and jnuche: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221027T0700).
[07:00:13] <apergos>	 morning!  there are two trainees signed up this morning but no patches in the window. I guess I'll give them the links for the docs and say a few words about how much simpler the deployment process is now with the new scap backport command, and then see if they want to reschedule, heh. 
[07:03:16] <Amir1>	 I'm joining too
[07:04:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P36646 and previous config saved to /var/cache/conftool/dbconfig/20221027-070427-ladsgroup.json
[07:05:16] <wikibugs>	 (03PS2) 10David Caro: openstack.add_flavor: create cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/849957 (https://phabricator.wikimedia.org/T321657)
[07:06:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P36647 and previous config saved to /var/cache/conftool/dbconfig/20221027-070644-ladsgroup.json
[07:08:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] openstack.add_flavor: create cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/849957 (https://phabricator.wikimedia.org/T321657) (owner: 10David Caro)
[07:09:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1009.eqiad.wmnet with reason: host reimage
[07:12:29] <wikibugs>	 (03PS27) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040)
[07:12:42] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1009.eqiad.wmnet with reason: host reimage
[07:12:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[07:14:47] <wikibugs>	 (03PS3) 10David Caro: openstack.add_flavor: create cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/849957 (https://phabricator.wikimedia.org/T321657)
[07:15:12] <wikibugs>	 (03CR) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[07:17:35] <apergos>	 thanks everybody, see you all next time!
[07:19:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T318950)', diff saved to https://phabricator.wikimedia.org/P36648 and previous config saved to /var/cache/conftool/dbconfig/20221027-071934-ladsgroup.json
[07:19:36] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[07:19:40] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[07:19:47] <sergi0_>	 apergos: ty for your clarifications and directions!
[07:19:49] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[07:20:05] <apergos>	 sure thing sergi0_   thanks for showing up!
[07:21:11] <wikibugs>	 (03PS28) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040)
[07:21:40] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[07:21:42] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[07:21:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T318950)', diff saved to https://phabricator.wikimedia.org/P36649 and previous config saved to /var/cache/conftool/dbconfig/20221027-072148-ladsgroup.json
[07:21:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T318950)', diff saved to https://phabricator.wikimedia.org/P36650 and previous config saved to /var/cache/conftool/dbconfig/20221027-072157-ladsgroup.json
[07:21:59] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2109.codfw.wmnet with reason: Maintenance
[07:22:13] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2109.codfw.wmnet with reason: Maintenance
[07:22:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T318950)', diff saved to https://phabricator.wikimedia.org/P36651 and previous config saved to /var/cache/conftool/dbconfig/20221027-072219-ladsgroup.json
[07:24:23] <wikibugs>	 (03PS29) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040)
[07:24:41] <wikibugs>	 (03CR) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[07:25:05] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2023.codfw.wmnet with reason: Maintenance
[07:25:30] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2023.codfw.wmnet with reason: Maintenance
[07:25:33] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Duplicate monitoring for systemd::timer::job - https://phabricator.wikimedia.org/T303253 (10Joe) Hi, as stated in the email thread, I don't think this is a good course of action. `systemd::monitor` offers more functionality than the generic...
[07:25:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es2023 (T321312)', diff saved to https://phabricator.wikimedia.org/P36652 and previous config saved to /var/cache/conftool/dbconfig/20221027-072536-ladsgroup.json
[07:25:41] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: monitoring: introduce exclude list for checking systemd units [puppet] - 10https://gerrit.wikimedia.org/r/849928 (https://phabricator.wikimedia.org/T303253)
[07:25:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T318950)', diff saved to https://phabricator.wikimedia.org/P36653 and previous config saved to /var/cache/conftool/dbconfig/20221027-072543-ladsgroup.json
[07:25:45] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: check_systemd_state: consume exclusion list [puppet] - 10https://gerrit.wikimedia.org/r/849929 (https://phabricator.wikimedia.org/T303253)
[07:25:49] <icinga-wm>	 PROBLEM - Check systemd state on wcqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:25:49] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[07:26:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[07:27:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] monitoring: introduce exclude list for checking systemd units [puppet] - 10https://gerrit.wikimedia.org/r/849928 (https://phabricator.wikimedia.org/T303253) (owner: 10Giuseppe Lavagetto)
[07:27:46] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1009.eqiad.wmnet with OS bullseye
[07:27:50] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1009.eqiad.wmnet with OS bullseye completed: - ganeti1009 (**PASS**)   - Downtimed on...
[07:27:53] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] check_systemd_state: consume exclusion list [puppet] - 10https://gerrit.wikimedia.org/r/849929 (https://phabricator.wikimedia.org/T303253) (owner: 10Giuseppe Lavagetto)
[07:28:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[07:30:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T318950)', diff saved to https://phabricator.wikimedia.org/P36654 and previous config saved to /var/cache/conftool/dbconfig/20221027-073014-ladsgroup.json
[07:32:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1009.eqiad.wmnet
[07:36:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2023 (T321312)', diff saved to https://phabricator.wikimedia.org/P36655 and previous config saved to /var/cache/conftool/dbconfig/20221027-073612-ladsgroup.json
[07:38:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagetcd1005.eqiad.wmnet to drbd
[07:39:35] <dcausse>	 !log restarting blazegraph on wdqs1016 (BlazegraphFreeAllocatorsDecreasingRapidly)
[07:39:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:40:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1009.eqiad.wmnet
[07:40:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P36656 and previous config saved to /var/cache/conftool/dbconfig/20221027-074050-ladsgroup.json
[07:43:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[07:44:45] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: monitoring: introduce exclude list for checking systemd units [puppet] - 10https://gerrit.wikimedia.org/r/849928 (https://phabricator.wikimedia.org/T303253)
[07:44:47] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: check_systemd_state: consume exclusion list [puppet] - 10https://gerrit.wikimedia.org/r/849929 (https://phabricator.wikimedia.org/T303253)
[07:45:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P36657 and previous config saved to /var/cache/conftool/dbconfig/20221027-074521-ladsgroup.json
[07:46:25] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] monitoring: introduce exclude list for checking systemd units [puppet] - 10https://gerrit.wikimedia.org/r/849928 (https://phabricator.wikimedia.org/T303253) (owner: 10Giuseppe Lavagetto)
[07:46:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] check_systemd_state: consume exclusion list [puppet] - 10https://gerrit.wikimedia.org/r/849929 (https://phabricator.wikimedia.org/T303253) (owner: 10Giuseppe Lavagetto)
[07:47:43] <wikibugs>	 (03PS1) 10David Caro: puppet_enc: use the repo-wide line length and fix profile [puppet] - 10https://gerrit.wikimedia.org/r/850005
[07:48:08] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagetcd1005.eqiad.wmnet to drbd
[07:48:26] <icinga-wm>	 PROBLEM - Host kubestagetcd1005 is DOWN: PING CRITICAL - Packet loss = 100%
[07:48:55] <moritzm>	 ^ can be ignored, monitoring glitch
[07:49:06] <icinga-wm>	 RECOVERY - Host kubestagetcd1005 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms
[07:49:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppet_enc: use the repo-wide line length and fix profile [puppet] - 10https://gerrit.wikimedia.org/r/850005 (owner: 10David Caro)
[07:50:59] <wikibugs>	 (03PS23) 10Slyngshede: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428)
[07:51:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2023', diff saved to https://phabricator.wikimedia.org/P36658 and previous config saved to /var/cache/conftool/dbconfig/20221027-075118-ladsgroup.json
[07:51:34] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede)
[07:53:08] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance
[07:53:21] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance
[07:53:26] <wikibugs>	 (03PS2) 10David Caro: puppet_enc: use the repo-wide line length and fix profile [puppet] - 10https://gerrit.wikimedia.org/r/850005
[07:53:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T318605)', diff saved to https://phabricator.wikimedia.org/P36659 and previous config saved to /var/cache/conftool/dbconfig/20221027-075327-ladsgroup.json
[07:53:33] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[07:53:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagetcd1005.eqiad.wmnet to plain
[07:53:45] <wikibugs>	 (03PS24) 10Slyngshede: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428)
[07:54:14] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1024.eqiad.wmnet with reason: Maintenance
[07:54:27] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1024.eqiad.wmnet with reason: Maintenance
[07:54:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1024 (T321312)', diff saved to https://phabricator.wikimedia.org/P36660 and previous config saved to /var/cache/conftool/dbconfig/20221027-075433-ladsgroup.json
[07:54:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[07:55:49] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagetcd1005.eqiad.wmnet to plain
[07:55:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P36661 and previous config saved to /var/cache/conftool/dbconfig/20221027-075556-ladsgroup.json
[07:59:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[08:00:04] <jouncebot>	 jnuche and hashar: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221027T0800).
[08:00:10] <jinxer-wm>	 (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported
[08:00:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P36662 and previous config saved to /var/cache/conftool/dbconfig/20221027-080027-ladsgroup.json
[08:02:46] <wikibugs>	 (03CR) 10Slyngshede: role::idm Basic deployment of IDM (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede)
[08:02:57] <wikibugs>	 (03CR) 10Slyngshede: role::idm Basic deployment of IDM (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede)
[08:03:45] <wikibugs>	 (03CR) 10Elukey: "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/849562 (https://phabricator.wikimedia.org/T321159) (owner: 10Elukey)
[08:05:49] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Promote pc1014 to pc3 master [puppet] - 10https://gerrit.wikimedia.org/r/850027
[08:06:22] <wikibugs>	 (03PS1) 10TrainBranchBot: all wikis to 1.40.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850028 (https://phabricator.wikimedia.org/T320512)
[08:06:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.40.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850028 (https://phabricator.wikimedia.org/T320512) (owner: 10TrainBranchBot)
[08:06:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2023', diff saved to https://phabricator.wikimedia.org/P36663 and previous config saved to /var/cache/conftool/dbconfig/20221027-080625-ladsgroup.json
[08:06:51] <wikibugs>	 (03PS1) 10Marostegui: ProductionServices.php: Promote pc1014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850029
[08:07:13] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.40.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850028 (https://phabricator.wikimedia.org/T320512) (owner: 10TrainBranchBot)
[08:08:25] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] ProductionServices.php: Promote pc1014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850029 (owner: 10Marostegui)
[08:08:48] <marostegui>	 jnuche: Could you let me know once you've finished with the train?
[08:09:16] <jnuche>	 marostegui: sure, will do
[08:09:22] <marostegui>	 thank you
[08:09:51] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good, one final nit." [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede)
[08:11:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T318950)', diff saved to https://phabricator.wikimedia.org/P36664 and previous config saved to /var/cache/conftool/dbconfig/20221027-081103-ladsgroup.json
[08:11:05] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[08:11:07] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[08:11:10] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[08:11:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T318950)', diff saved to https://phabricator.wikimedia.org/P36665 and previous config saved to /var/cache/conftool/dbconfig/20221027-081113-ladsgroup.json
[08:11:24] <logmsgbot>	 !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.40.0-wmf.7  refs T320512
[08:11:29] <stashbot>	 T320512: 1.40.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T320512
[08:11:45] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] "This really should be automated :sob:" [puppet] - 10https://gerrit.wikimedia.org/r/850027 (owner: 10Marostegui)
[08:13:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[08:13:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1024 (T321312)', diff saved to https://phabricator.wikimedia.org/P36666 and previous config saved to /var/cache/conftool/dbconfig/20221027-081339-ladsgroup.json
[08:14:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[08:14:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[08:15:13] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[08:15:16] <jnuche>	 marostegui: deployment is complete
[08:15:23] <marostegui>	 thank you!
[08:15:27] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote pc1014 to pc3 master [puppet] - 10https://gerrit.wikimedia.org/r/850027 (owner: 10Marostegui)
[08:15:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T318950)', diff saved to https://phabricator.wikimedia.org/P36667 and previous config saved to /var/cache/conftool/dbconfig/20221027-081534-ladsgroup.json
[08:15:36] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance
[08:15:50] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance
[08:16:17] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc1014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850029 (owner: 10Marostegui)
[08:16:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1009.eqiad.wmnet to cluster eqiad and group C
[08:16:58] <wikibugs>	 (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc1014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850029 (owner: 10Marostegui)
[08:17:15] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by marostegui@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850029 (owner: 10Marostegui)
[08:17:30] <logmsgbot>	 !log marostegui@deploy1002 Started scap: Backport for [[gerrit:850029|ProductionServices.php: Promote pc1014 to pc3 master]]
[08:17:31] <marostegui>	 Amir1: ^ \o/
[08:17:36] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1009.eqiad.wmnet to cluster eqiad and group C
[08:17:45] <Amir1>	 wohooo
[08:17:49] <logmsgbot>	 !log marostegui@deploy1002 marostegui and marostegui: Backport for [[gerrit:850029|ProductionServices.php: Promote pc1014 to pc3 master]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet
[08:17:51] <Amir1>	 this thing is awesome
[08:18:10] <wikibugs>	 (03CR) 10Ladsgroup: [C: 04-1] add_cuc_user_ip_time_index_T321123.py: New schema change (035 comments) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/849915 (https://phabricator.wikimedia.org/T321123) (owner: 10Marostegui)
[08:18:40] <Amir1>	 oh schema changes show up here too. nice
[08:19:15] <jbond>	 !log upload vim python3-stdlib-extensions to buster componet/python39
[08:19:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:20:14] <elukey>	 !log powercycle elastic2043 - no mgmt console tty available, not responsive to ssh, memory/dimm errors in `racadm getsel`
[08:20:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[08:20:17] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] puppet_enc: use the repo-wide line length and fix profile [puppet] - 10https://gerrit.wikimedia.org/r/850005 (owner: 10David Caro)
[08:20:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:20:21] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+1] opensearch: make upgrade-phatality.sh stricter [puppet] - 10https://gerrit.wikimedia.org/r/849631 (https://phabricator.wikimedia.org/T304440) (owner: 10Hashar)
[08:20:59] <wikibugs>	 (03PS2) 10Marostegui: add_cuc_user_ip_time_index_T321123.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/849915 (https://phabricator.wikimedia.org/T321123)
[08:21:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[08:21:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[08:21:22] <wikibugs>	 (03CR) 10Marostegui: add_cuc_user_ip_time_index_T321123.py: New schema change (035 comments) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/849915 (https://phabricator.wikimedia.org/T321123) (owner: 10Marostegui)
[08:21:30] <wikibugs>	 10ops-codfw, 10Discovery-Search: elastic2043 reported memory errors - https://phabricator.wikimedia.org/T321771 (10elukey)
[08:21:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2023 (T321312)', diff saved to https://phabricator.wikimedia.org/P36668 and previous config saved to /var/cache/conftool/dbconfig/20221027-082131-ladsgroup.json
[08:21:37] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2025.codfw.wmnet with reason: Maintenance
[08:21:47] <wikibugs>	 (03CR) 10Clément Goubert: monitoring: introduce exclude list for checking systemd units (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849928 (https://phabricator.wikimedia.org/T303253) (owner: 10Giuseppe Lavagetto)
[08:21:51] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2025.codfw.wmnet with reason: Maintenance
[08:21:51] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance
[08:21:52] <logmsgbot>	 !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:850029|ProductionServices.php: Promote pc1014 to pc3 master]] (duration: 04m 22s)
[08:21:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es2025 (T321312)', diff saved to https://phabricator.wikimedia.org/P36669 and previous config saved to /var/cache/conftool/dbconfig/20221027-082157-ladsgroup.json
[08:21:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] add_cuc_user_ip_time_index_T321123.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/849915 (https://phabricator.wikimedia.org/T321123) (owner: 10Marostegui)
[08:22:05] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance
[08:22:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T318950)', diff saved to https://phabricator.wikimedia.org/P36670 and previous config saved to /var/cache/conftool/dbconfig/20221027-082211-ladsgroup.json
[08:22:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[08:22:17] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[08:22:20] <icinga-wm>	 RECOVERY - Host elastic2043 is UP: PING OK - Packet loss = 0%, RTA = 31.73 ms
[08:23:35] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] openstack.add_flavor: create cookbook (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/849957 (https://phabricator.wikimedia.org/T321657) (owner: 10David Caro)
[08:23:43] <wikibugs>	 (03PS3) 10Marostegui: add_cuc_user_ip_time_index_T321123.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/849915 (https://phabricator.wikimedia.org/T321123)
[08:24:12] <icinga-wm>	 PROBLEM - Check systemd state on elastic2043 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:25:09] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] coredns: add standard labels to resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/849562 (https://phabricator.wikimedia.org/T321159) (owner: 10Elukey)
[08:26:16] <icinga-wm>	 RECOVERY - Check systemd state on elastic2043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:27:00] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/849494 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah)
[08:27:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T318605)', diff saved to https://phabricator.wikimedia.org/P36671 and previous config saved to /var/cache/conftool/dbconfig/20221027-082707-ladsgroup.json
[08:27:14] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[08:28:38] <wikibugs>	 (03PS25) 10Slyngshede: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428)
[08:28:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1024', diff saved to https://phabricator.wikimedia.org/P36672 and previous config saved to /var/cache/conftool/dbconfig/20221027-082846-ladsgroup.json
[08:29:14] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] add_cuc_user_ip_time_index_T321123.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/849915 (https://phabricator.wikimedia.org/T321123) (owner: 10Marostegui)
[08:30:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T318950)', diff saved to https://phabricator.wikimedia.org/P36673 and previous config saved to /var/cache/conftool/dbconfig/20221027-083017-ladsgroup.json
[08:30:24] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[08:30:27] <wikibugs>	 (03CR) 10David Caro: reprepro: add kubeadm-k8s-1-21/22 bullseye suite (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/848354 (https://phabricator.wikimedia.org/T316541) (owner: 10David Caro)
[08:31:07] <wikibugs>	 (03CR) 10Slyngshede: role::idm Basic deployment of IDM (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede)
[08:32:40] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[08:32:43] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[08:34:28] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10Clement_Goubert) Settling on `mw-web` as there's been no contrary opinion in a week.
[08:35:52] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] openstack.add_flavor: create cookbook (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/849957 (https://phabricator.wikimedia.org/T321657) (owner: 10David Caro)
[08:36:16] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[08:36:19] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[08:36:24] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] add_cuc_user_ip_time_index_T321123.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/849915 (https://phabricator.wikimedia.org/T321123) (owner: 10Marostegui)
[08:37:04] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[08:37:06] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[08:37:49] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+2] [labs] GrowthExperiments: Beta wikis to use NewImpact module by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844046 (https://phabricator.wikimedia.org/T311299) (owner: 10Kosta Harlan)
[08:38:33] <wikibugs>	 (03Merged) 10jenkins-bot: [labs] GrowthExperiments: Beta wikis to use NewImpact module by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844046 (https://phabricator.wikimedia.org/T311299) (owner: 10Kosta Harlan)
[08:38:39] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] p::metricsinfra:haproxy: Allow exposing federation endpoints [puppet] - 10https://gerrit.wikimedia.org/r/829746 (https://phabricator.wikimedia.org/T313031) (owner: 10David Caro)
[08:38:56] <wikibugs>	 (03PS8) 10David Caro: p::metricsinfra:haproxy: Allow exposing federation endpoints [puppet] - 10https://gerrit.wikimedia.org/r/829746 (https://phabricator.wikimedia.org/T313031)
[08:38:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (15) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:39:04] <wikibugs>	 (03Merged) 10jenkins-bot: openstack.add_flavor: create cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/849957 (https://phabricator.wikimedia.org/T321657) (owner: 10David Caro)
[08:39:06] <wikibugs>	 (03Merged) 10jenkins-bot: add_cuc_user_ip_time_index_T321123.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/849915 (https://phabricator.wikimedia.org/T321123) (owner: 10Marostegui)
[08:40:48] <icinga-wm>	 PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:41:26] <icinga-wm>	 RECOVERY - Check systemd state on ml-serve-ctrl1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:42:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P36674 and previous config saved to /var/cache/conftool/dbconfig/20221027-084214-ladsgroup.json
[08:42:28] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:42:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[08:43:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[08:43:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[08:43:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1024', diff saved to https://phabricator.wikimedia.org/P36675 and previous config saved to /var/cache/conftool/dbconfig/20221027-084352-ladsgroup.json
[08:43:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (15) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:44:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[08:45:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P36676 and previous config saved to /var/cache/conftool/dbconfig/20221027-084523-ladsgroup.json
[08:46:53] <wikibugs>	 (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc1014 to pc3 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850064
[08:47:09] <wikibugs>	 (03PS1) 10Marostegui: Revert "mariadb: Promote pc1014 to pc3 master" [puppet] - 10https://gerrit.wikimedia.org/r/850065
[08:47:45] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Promote pc1014 to pc3 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850064 (owner: 10Marostegui)
[08:47:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Set profile::contacts::role_contacts for gitlab runners [puppet] - 10https://gerrit.wikimedia.org/r/849063 (owner: 10Muehlenhoff)
[08:48:14] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "mariadb: Promote pc1014 to pc3 master" [puppet] - 10https://gerrit.wikimedia.org/r/850065 (owner: 10Marostegui)
[08:48:40] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Promote pc1014 to pc3 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850064 (owner: 10Marostegui)
[08:48:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by marostegui@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850064 (owner: 10Marostegui)
[08:49:48] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting deployment group membership for mfossati - https://phabricator.wikimedia.org/T321772 (10mfossati)
[08:50:37] <logmsgbot>	 !log marostegui@deploy1002 Started scap: Backport for [[gerrit:850064|Revert "ProductionServices.php: Promote pc1014 to pc3 master"]]
[08:50:56] <logmsgbot>	 !log marostegui@deploy1002 marostegui and marostegui: Backport for [[gerrit:850064|Revert "ProductionServices.php: Promote pc1014 to pc3 master"]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[08:51:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede)
[08:52:41] <wikibugs>	 (03PS1) 10Marostegui: pc1014: Move it to pc2 [puppet] - 10https://gerrit.wikimedia.org/r/850037
[08:53:38] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Duplicate monitoring for systemd::timer::job - https://phabricator.wikimedia.org/T303253 (10fgiunchedi) An update based on the feedback received by SREs: individual alerts for each `::job` are considered useful because the alerts can be dow...
[08:54:06] <wikibugs>	 (03CR) 10WMDE-Fisch: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842353 (https://phabricator.wikimedia.org/T320692) (owner: 10WMDE-Fisch)
[08:54:23] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting deployment group membership for mfossati - https://phabricator.wikimedia.org/T321772 (10mfossati) Noting that I completed a deployment training session, see {T302204}. It will be useful for the next one to have deployment access, see {T313812}.  @thcipriani : not sure a...
[08:54:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[08:54:38] <icinga-wm>	 PROBLEM - SSH on mw1338.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:55:24] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2114.codfw.wmnet with reason: Maintenance
[08:55:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[08:55:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[08:55:30] <logmsgbot>	 !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:850064|Revert "ProductionServices.php: Promote pc1014 to pc3 master"]] (duration: 04m 52s)
[08:55:38] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2114.codfw.wmnet with reason: Maintenance
[08:56:09] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2117.codfw.wmnet with reason: Maintenance
[08:56:11] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2117.codfw.wmnet with reason: Maintenance
[08:56:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T321123)', diff saved to https://phabricator.wikimedia.org/P36678 and previous config saved to /var/cache/conftool/dbconfig/20221027-085617-marostegui.json
[08:56:18] <icinga-wm>	 PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[08:56:22] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[08:56:23] <stashbot>	 T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123
[08:56:56] <wikibugs>	 (03PS3) 10WMDE-Fisch: Enable show nearby feature on de.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842353 (https://phabricator.wikimedia.org/T320692)
[08:57:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P36679 and previous config saved to /var/cache/conftool/dbconfig/20221027-085720-ladsgroup.json
[08:57:31] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'.
[08:57:36] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
[08:57:55] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: monitoring: introduce exclude list for checking systemd units [puppet] - 10https://gerrit.wikimedia.org/r/849928 (https://phabricator.wikimedia.org/T303253)
[08:57:57] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: check_systemd_state: consume exclusion list [puppet] - 10https://gerrit.wikimedia.org/r/849929 (https://phabricator.wikimedia.org/T303253)
[08:58:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T321123)', diff saved to https://phabricator.wikimedia.org/P36680 and previous config saved to /var/cache/conftool/dbconfig/20221027-085829-marostegui.json
[08:58:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1024 (T321312)', diff saved to https://phabricator.wikimedia.org/P36681 and previous config saved to /var/cache/conftool/dbconfig/20221027-085859-ladsgroup.json
[08:59:04] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1025.eqiad.wmnet with reason: Maintenance
[08:59:28] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1025.eqiad.wmnet with reason: Maintenance
[08:59:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1025 (T321312)', diff saved to https://phabricator.wikimedia.org/P36682 and previous config saved to /var/cache/conftool/dbconfig/20221027-085934-ladsgroup.json
[08:59:38] <wikibugs>	 (03PS1) 10Jbond: tox.ini: drop support for python3.7/3.8 [cookbooks] - 10https://gerrit.wikimedia.org/r/850038
[09:00:00] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] pc1014: Move it to pc2 [puppet] - 10https://gerrit.wikimedia.org/r/850037 (owner: 10Marostegui)
[09:00:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P36683 and previous config saved to /var/cache/conftool/dbconfig/20221027-090030-ladsgroup.json
[09:00:53] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] tox.ini: drop support for python3.7/3.8 [cookbooks] - 10https://gerrit.wikimedia.org/r/850038 (owner: 10Jbond)
[09:01:21] <icinga-wm>	 RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.41 ms
[09:08:53] <wikibugs>	 (03Abandoned) 10Filippo Giunchedi: dns: generate HOST.mgmt records in all statuses [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/849495 (https://phabricator.wikimedia.org/T320721) (owner: 10Filippo Giunchedi)
[09:09:34] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'.
[09:09:36] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'.
[09:10:03] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'.
[09:10:05] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'.
[09:10:32] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] START helmfile.d/admin 'sync'.
[09:10:37] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'sync'.
[09:11:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T318950)', diff saved to https://phabricator.wikimedia.org/P36684 and previous config saved to /var/cache/conftool/dbconfig/20221027-091130-ladsgroup.json
[09:11:36] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[09:12:09] <wikibugs>	 (03PS1) 10Marostegui: ProductionServices.php: Promote pc2014 to pc1 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850040
[09:12:15] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-2] "Let's get rid of the trafficserver9 component (after copying the packages to main of course)" [puppet] - 10https://gerrit.wikimedia.org/r/849640 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[09:12:25] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1083 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[09:12:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T318605)', diff saved to https://phabricator.wikimedia.org/P36685 and previous config saved to /var/cache/conftool/dbconfig/20221027-091227-ladsgroup.json
[09:12:29] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance
[09:12:33] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[09:12:42] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance
[09:12:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2124 (T318605)', diff saved to https://phabricator.wikimedia.org/P36686 and previous config saved to /var/cache/conftool/dbconfig/20221027-091249-ladsgroup.json
[09:13:00] <logmsgbot>	 !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'.
[09:13:02] <logmsgbot>	 !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[09:13:22] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Promote pc2014 to pc1 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/850041
[09:13:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P36687 and previous config saved to /var/cache/conftool/dbconfig/20221027-091336-marostegui.json
[09:14:28] <wikibugs>	 10SRE, 10Traffic: Clean up after ATS 9.x upgrade - https://phabricator.wikimedia.org/T321776 (10Vgutierrez)
[09:14:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] confd: don't backup tidied files [puppet] - 10https://gerrit.wikimedia.org/r/849600 (https://phabricator.wikimedia.org/T321678) (owner: 10Filippo Giunchedi)
[09:14:52] <wikibugs>	 10SRE, 10Traffic: Clean up after ATS 9.x upgrade - https://phabricator.wikimedia.org/T321776 (10Vgutierrez) p:05Triage→03Medium
[09:15:12] <wikibugs>	 (03PS2) 10Marostegui: ProductionServices.php: Promote pc2014 to pc1 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850040
[09:15:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T318950)', diff saved to https://phabricator.wikimedia.org/P36688 and previous config saved to /var/cache/conftool/dbconfig/20221027-091536-ladsgroup.json
[09:15:39] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance
[09:15:53] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance
[09:15:54] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance
[09:15:57] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance
[09:16:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T318950)', diff saved to https://phabricator.wikimedia.org/P36689 and previous config saved to /var/cache/conftool/dbconfig/20221027-091603-ladsgroup.json
[09:17:39] <moritzm>	 !log failover ganeti master in ulsfo to ganeti4008, unblocking future decom of ganeti4003 T317247
[09:17:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:45] <stashbot>	 T317247: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247
[09:17:52] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] global: replace labsproject by wmcs_project (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849473 (owner: 10David Caro)
[09:19:18] <wikibugs>	 10SRE, 10serviceops-radar, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Clean up stale/old confd errors automatically - https://phabricator.wikimedia.org/T321678 (10fgiunchedi) 05Open→03Resolved This is done!
[09:19:25] <wikibugs>	 10SRE, 10serviceops-radar, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10fgiunchedi)
[09:20:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2025 (T321312)', diff saved to https://phabricator.wikimedia.org/P36691 and previous config saved to /var/cache/conftool/dbconfig/20221027-092028-ladsgroup.json
[09:21:13] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti4003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[09:23:13] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1083 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[09:23:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T318950)', diff saved to https://phabricator.wikimedia.org/P36692 and previous config saved to /var/cache/conftool/dbconfig/20221027-092355-ladsgroup.json
[09:24:02] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[09:24:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetdb-test2001.codfw.wmnet
[09:26:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P36693 and previous config saved to /var/cache/conftool/dbconfig/20221027-092636-ladsgroup.json
[09:26:45] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1059 is CRITICAL: CRITICAL - degraded: The following units failed: swift-object.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:28:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P36694 and previous config saved to /var/cache/conftool/dbconfig/20221027-092842-marostegui.json
[09:29:56] <icinga-wm>	 PROBLEM - Check systemd state on mw2334 is CRITICAL: CRITICAL - degraded: The following units failed: php7.4-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:32:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1025 (T321312)', diff saved to https://phabricator.wikimedia.org/P36696 and previous config saved to /var/cache/conftool/dbconfig/20221027-093250-ladsgroup.json
[09:34:21] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host puppetdb-test2001.codfw.wmnet
[09:35:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2117', diff saved to https://phabricator.wikimedia.org/P36697 and previous config saved to /var/cache/conftool/dbconfig/20221027-093519-marostegui.json
[09:35:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2025', diff saved to https://phabricator.wikimedia.org/P36698 and previous config saved to /var/cache/conftool/dbconfig/20221027-093534-ladsgroup.json
[09:37:12] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2097.codfw.wmnet with reason: Maintenance
[09:37:25] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2097.codfw.wmnet with reason: Maintenance
[09:37:45] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2104.codfw.wmnet with reason: Maintenance
[09:37:58] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2104.codfw.wmnet with reason: Maintenance
[09:38:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2104 (T321123)', diff saved to https://phabricator.wikimedia.org/P36699 and previous config saved to /var/cache/conftool/dbconfig/20221027-093804-marostegui.json
[09:38:10] <stashbot>	 T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123
[09:38:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling reboot on A:ldap-replicas-codfw
[09:39:01] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] kubernetes: Rename mwdebug to mw-debug (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849502 (https://phabricator.wikimedia.org/T321201) (owner: 10Clément Goubert)
[09:39:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P36700 and previous config saved to /var/cache/conftool/dbconfig/20221027-093902-ladsgroup.json
[09:39:25] <wikibugs>	 (03CR) 10David Caro: wmcs: changes to api service to manage toolforge replica.my.cnf (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[09:40:14] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote pc2014 to pc1 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/850041 (owner: 10Marostegui)
[09:40:19] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] ProductionServices.php: Promote pc2014 to pc1 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850040 (owner: 10Marostegui)
[09:40:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T321123)', diff saved to https://phabricator.wikimedia.org/P36701 and previous config saved to /var/cache/conftool/dbconfig/20221027-094030-marostegui.json
[09:40:35] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc2014 to pc1 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850040 (owner: 10Marostegui)
[09:41:25] <wikibugs>	 (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc2014 to pc1 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850040 (owner: 10Marostegui)
[09:41:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by marostegui@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850040 (owner: 10Marostegui)
[09:41:42] <logmsgbot>	 !log marostegui@deploy1002 Started scap: Backport for [[gerrit:850040|ProductionServices.php: Promote pc2014 to pc1 codfw master]]
[09:41:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P36702 and previous config saved to /var/cache/conftool/dbconfig/20221027-094143-ladsgroup.json
[09:41:51] <wikibugs>	 (03PS3) 10AikoChou: ml-services: add revert-risk-model isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/849627 (https://phabricator.wikimedia.org/T321594)
[09:42:01] <logmsgbot>	 !log marostegui@deploy1002 marostegui and marostegui: Backport for [[gerrit:850040|ProductionServices.php: Promote pc2014 to pc1 codfw master]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[09:46:08] <logmsgbot>	 !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:850040|ProductionServices.php: Promote pc2014 to pc1 codfw master]] (duration: 04m 26s)
[09:46:34] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling reboot on A:ldap-replicas-codfw
[09:46:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T318605)', diff saved to https://phabricator.wikimedia.org/P36703 and previous config saved to /var/cache/conftool/dbconfig/20221027-094655-ladsgroup.json
[09:47:01] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[09:47:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[09:47:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1025', diff saved to https://phabricator.wikimedia.org/P36704 and previous config saved to /var/cache/conftool/dbconfig/20221027-094756-ladsgroup.json
[09:48:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[09:48:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[09:48:09] <wikibugs>	 (03CR) 10Jbond: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/850038 (owner: 10Jbond)
[09:49:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[09:50:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2025', diff saved to https://phabricator.wikimedia.org/P36705 and previous config saved to /var/cache/conftool/dbconfig/20221027-095041-ladsgroup.json
[09:52:06] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: add revert-risk-model isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/849627 (https://phabricator.wikimedia.org/T321594) (owner: 10AikoChou)
[09:52:11] <wikibugs>	 (03PS1) 10Marostegui: Revert "mariadb: Promote pc2014 to pc1 codfw master" [puppet] - 10https://gerrit.wikimedia.org/r/850069
[09:52:23] <wikibugs>	 (03PS1) 10Vgutierrez: trafficserver: Clean up after ATS 9.x upgrade [puppet] - 10https://gerrit.wikimedia.org/r/850087 (https://phabricator.wikimedia.org/T321776)
[09:52:28] <wikibugs>	 (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc2014 to pc1 codfw master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850070
[09:53:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] trafficserver: Clean up after ATS 9.x upgrade [puppet] - 10https://gerrit.wikimedia.org/r/850087 (https://phabricator.wikimedia.org/T321776) (owner: 10Vgutierrez)
[09:53:50] <icinga-wm>	 RECOVERY - Check systemd state on wcqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:54:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P36706 and previous config saved to /var/cache/conftool/dbconfig/20221027-095408-ladsgroup.json
[09:54:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling reboot on A:ldap-replicas-eqiad
[09:55:20] <icinga-wm>	 RECOVERY - SSH on mw1338.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:55:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P36707 and previous config saved to /var/cache/conftool/dbconfig/20221027-095537-marostegui.json
[09:55:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host krb2002.codfw.wmnet
[09:56:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T318950)', diff saved to https://phabricator.wikimedia.org/P36708 and previous config saved to /var/cache/conftool/dbconfig/20221027-095649-ladsgroup.json
[09:56:52] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[09:56:54] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[09:56:55] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[09:56:59] <wikibugs>	 (03CR) 10Jbond: doc: add parameters for gitlab_runner and contint hosts, allow them (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn)
[09:57:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T318950)', diff saved to https://phabricator.wikimedia.org/P36709 and previous config saved to /var/cache/conftool/dbconfig/20221027-095700-ladsgroup.json
[09:57:28] <wikibugs>	 (03CR) 10Jbond: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/850038 (owner: 10Jbond)
[10:00:04] <jouncebot>	 mvolz: #bothumor I � Unicode. All rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221027T1000).
[10:00:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb2002.codfw.wmnet
[10:00:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T318950)', diff saved to https://phabricator.wikimedia.org/P36710 and previous config saved to /var/cache/conftool/dbconfig/20221027-100057-ladsgroup.json
[10:02:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P36711 and previous config saved to /var/cache/conftool/dbconfig/20221027-100201-ladsgroup.json
[10:02:42] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling reboot on A:ldap-replicas-eqiad
[10:03:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1025', diff saved to https://phabricator.wikimedia.org/P36712 and previous config saved to /var/cache/conftool/dbconfig/20221027-100303-ladsgroup.json
[10:03:28] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1059.eqiad.wmnet
[10:03:39] <wikibugs_>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/842359 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi)
[10:05:30] <wikibugs_>	 (03CR) 10Filippo Giunchedi: [C: 03+2] customscripts: exclude decommissioning hosts from mgmt data [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/842359 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi)
[10:05:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2025 (T321312)', diff saved to https://phabricator.wikimedia.org/P36713 and previous config saved to /var/cache/conftool/dbconfig/20221027-100547-ladsgroup.json
[10:06:44] <wikibugs_>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff)
[10:08:03] <wikibugs>	 (03PS2) 10Clément Goubert: mediawiki: Create new mw-debug deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/849501 (https://phabricator.wikimedia.org/T321201)
[10:08:05] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "mariadb: Promote pc2014 to pc1 codfw master" [puppet] - 10https://gerrit.wikimedia.org/r/850069 (owner: 10Marostegui)
[10:08:16] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Promote pc2014 to pc1 codfw master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850070 (owner: 10Marostegui)
[10:09:14] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Promote pc2014 to pc1 codfw master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850070 (owner: 10Marostegui)
[10:09:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T318950)', diff saved to https://phabricator.wikimedia.org/P36714 and previous config saved to /var/cache/conftool/dbconfig/20221027-100915-ladsgroup.json
[10:09:17] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance
[10:09:19] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync-mgmt - filippo@cumin1001"
[10:09:21] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[10:09:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by marostegui@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850070 (owner: 10Marostegui)
[10:09:31] <logmsgbot>	 !log marostegui@deploy1002 Started scap: Backport for [[gerrit:850070|Revert "ProductionServices.php: Promote pc2014 to pc1 codfw master"]]
[10:09:42] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance
[10:09:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T318950)', diff saved to https://phabricator.wikimedia.org/P36715 and previous config saved to /var/cache/conftool/dbconfig/20221027-100948-ladsgroup.json
[10:09:50] <logmsgbot>	 !log marostegui@deploy1002 marostegui and marostegui: Backport for [[gerrit:850070|Revert "ProductionServices.php: Promote pc2014 to pc1 codfw master"]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet
[10:10:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P36716 and previous config saved to /var/cache/conftool/dbconfig/20221027-101043-marostegui.json
[10:11:32] <wikibugs>	 (03PS3) 10Clément Goubert: kubernetes: Rename mwdebug to mw-debug [puppet] - 10https://gerrit.wikimedia.org/r/849502 (https://phabricator.wikimedia.org/T321201)
[10:12:31] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "sync-mgmt - filippo@cumin1001"
[10:12:43] <wikibugs>	 (03CR) 10Clément Goubert: kubernetes: Rename mwdebug to mw-debug (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849502 (https://phabricator.wikimedia.org/T321201) (owner: 10Clément Goubert)
[10:13:36] <volans>	 godog: \o/
[10:13:37] <wikibugs>	 (03PS1) 10Marostegui: pc2014: Move it to pc2 [puppet] - 10https://gerrit.wikimedia.org/r/850090
[10:13:53] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] kubernetes: Rename mwdebug to mw-debug [puppet] - 10https://gerrit.wikimedia.org/r/849502 (https://phabricator.wikimedia.org/T321201) (owner: 10Clément Goubert)
[10:14:00] <logmsgbot>	 !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:850070|Revert "ProductionServices.php: Promote pc2014 to pc1 codfw master"]] (duration: 04m 29s)
[10:14:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host stat1009.eqiad.wmnet
[10:14:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[10:15:01] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1083 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[10:15:02] <wikibugs>	 (03PS4) 10Clément Goubert: kubernetes: Rename mwdebug to mw-debug [puppet] - 10https://gerrit.wikimedia.org/r/849502 (https://phabricator.wikimedia.org/T321201)
[10:15:12] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] pc2014: Move it to pc2 [puppet] - 10https://gerrit.wikimedia.org/r/850090 (owner: 10Marostegui)
[10:15:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[10:15:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[10:16:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P36717 and previous config saved to /var/cache/conftool/dbconfig/20221027-101604-ladsgroup.json
[10:16:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[10:17:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P36718 and previous config saved to /var/cache/conftool/dbconfig/20221027-101708-ladsgroup.json
[10:17:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T318950)', diff saved to https://phabricator.wikimedia.org/P36719 and previous config saved to /var/cache/conftool/dbconfig/20221027-101742-ladsgroup.json
[10:17:47] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[10:17:48] <godog>	 volans: \o/ indeed! hiera data updated
[10:18:15] <wikibugs>	 (03CR) 10Clément Goubert: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/842819 (https://phabricator.wikimedia.org/T316348) (owner: 10Clément Goubert)
[10:18:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es2030 as es1 master, es2031 as es2 master, es2029 as es3 master', diff saved to https://phabricator.wikimedia.org/P36720 and previous config saved to /var/cache/conftool/dbconfig/20221027-101842-marostegui.json
[10:18:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1025 (T321312)', diff saved to https://phabricator.wikimedia.org/P36721 and previous config saved to /var/cache/conftool/dbconfig/20221027-101848-ladsgroup.json
[10:18:53] <icinga-wm>	 PROBLEM - Check systemd state on wcqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:19:51] <wikibugs>	 10SRE, 10ContentTranslation, 10Machine-Learning-Team, 10Wikimedia Enterprise: Run NLLB-200 model in a new instance - https://phabricator.wikimedia.org/T321781 (10Pginer-WMF)
[10:20:33] <wikibugs>	 10SRE, 10ContentTranslation, 10Machine-Learning-Team, 10Wikimedia Enterprise: Run NLLB-200 model in a new instance - https://phabricator.wikimedia.org/T321781 (10Pginer-WMF)
[10:21:04] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Setup an initial bookworm host with Puppetdb 7 - https://phabricator.wikimedia.org/T321783 (10MoritzMuehlenhoff)
[10:21:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1009.eqiad.wmnet
[10:21:28] <wikibugs>	 (03PS1) 10Muehlenhoff: debian: Add bookworm [puppet] - 10https://gerrit.wikimedia.org/r/850092 (https://phabricator.wikimedia.org/T321783)
[10:22:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2026 es2027 es2028 for upgrade', diff saved to https://phabricator.wikimedia.org/P36722 and previous config saved to /var/cache/conftool/dbconfig/20221027-102209-root.json
[10:23:55] <icinga-wm>	 RECOVERY - Check systemd state on wcqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:24:25] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1059 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:25:15] <wikibugs>	 (03PS1) 10Jbond: aptrepo: Add component pyall [puppet] - 10https://gerrit.wikimedia.org/r/850093
[10:25:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T321123)', diff saved to https://phabricator.wikimedia.org/P36723 and previous config saved to /var/cache/conftool/dbconfig/20221027-102550-marostegui.json
[10:25:52] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2125.codfw.wmnet with reason: Maintenance
[10:25:56] <stashbot>	 T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123
[10:26:05] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2125.codfw.wmnet with reason: Maintenance
[10:26:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2125 (T321123)', diff saved to https://phabricator.wikimedia.org/P36724 and previous config saved to /var/cache/conftool/dbconfig/20221027-102611-marostegui.json
[10:28:17] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:28:31] <wikibugs>	 (03CR) 10Awight: [C: 03+1] Enable show nearby feature on de.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842353 (https://phabricator.wikimedia.org/T320692) (owner: 10WMDE-Fisch)
[10:28:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T321123)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20221027-102837-marostegui.json
[10:28:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2026 (re)pooling @ 1%: After upgrade', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20221027-102843-root.json
[10:28:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2027 (re)pooling @ 1%: After upgrade', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20221027-102847-root.json
[10:28:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2028 (re)pooling @ 1%: After upgrade', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20221027-102852-root.json
[10:29:42] <stashbot>	 marostegui@cumin1001: Failed to log message to wiki. Somebody should check the error logs.
[10:29:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[10:30:12] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=
[10:30:18] <jinxer-wm>	 (ProbeDown) firing: (3) Service api-https:443 has failed probes (http_api-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:30:19] <jinxer-wm>	 (ProbeDown) firing: (12) Service api-https:443 has failed probes (http_api-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:30:35] <jinxer-wm>	 (FrontendUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[10:30:35] <jinxer-wm>	 (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[10:30:54] <Emperor>	 Hm.
[10:30:58] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2027 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:31:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P36725 and previous config saved to /var/cache/conftool/dbconfig/20221027-103110-ladsgroup.json
[10:31:12] <icinga-wm>	 PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[10:31:12] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[10:31:16] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs3005 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb6_443: Servers cp3054.esams.wmnet, cp3062.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:31:16] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1017 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1087.eqiad.wmnet, cp1075.eqiad.wmnet, cp1089.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1087.eqiad.wmnet, cp1075.eqiad.wmnet, cp1089.eqiad.wmnet are marked down but pooled: testlb6_443: Servers cp1087.eqiad.wmnet, cp1089.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1087.eqiad.wmnet, cp1075.eq
[10:31:16] <icinga-wm>	 t, cp1089.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:31:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs5003 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5009.eqsin.wmnet, cp5011.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5012.eqsin.wmnet, cp5010.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5009.eqsin.wmnet, cp5011.eqsin.wmnet, cp5016.eqsin.wmnet, cp5015.eqsin.wmnet, cp5012.eqsin.wmnet, cp5007.eqsin.wmnet, cp5010.eqsin.wmnet are marked down but pooled
[10:31:28] <icinga-wm>	 6_443: Servers cp5009.eqsin.wmnet, cp5011.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5012.eqsin.wmnet, cp5007.eqsin.wmnet, cp5010.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5009.eqsin.wmnet, cp5011.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5012.eqsin.wmnet, cp5010.eqsin.wmnet, cp5007.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:31:32] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at codfw on alert1001 is CRITICAL: 1 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[10:31:32] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:31:32] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.9677 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[10:31:32] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:31:33] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:31:34] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/description/{title} (Get description for test page) timed out before a response was received: /{domain}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/summary/{
[10:31:34] <icinga-wm>	 Get summary for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[10:31:34] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:31:34] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - api-https_443: Servers mw1422.eqiad.wmnet, mw1426.eqiad.wmnet, mw1346.eqiad.wmnet, mw1380.eqiad.wmnet, mw1390.eqiad.wmnet, mw1447.eqiad.wmnet, mw1427.eqiad.wmnet, mw1361.eqiad.wmnet, mw1406.eqiad.wmnet, mw1374.eqiad.wmnet, mw1344.eqiad.wmnet, mw1428.eqiad.wmnet, mw1388.eqiad.wmnet, mw1358.eqiad.wmnet, mw1386.eqiad.wmnet, mw1348.eqiad.wmnet, mw1402.eq
[10:31:34] <icinga-wm>	 t, mw1448.eqiad.wmnet, mw1381.eqiad.wmnet, mw1362.eqiad.wmnet, mw1340.eqiad.wmnet, mw1449.eqiad.wmnet, mw1343.eqiad.wmnet, mw1421.eqiad.wmnet, mw1347.eqiad.wmnet, mw1377.eqiad.wmnet, mw1345.eqiad.wmnet, mw1375.eqiad.wmnet, mw1314.eqiad.wmnet, mw1424.eqiad.wmnet, mw1412.eqiad.wmnet, mw1398.eqiad.wmnet, mw1408.eqiad.wmnet, mw1404.eqiad.wmnet, mw1376.eqiad.wmnet, mw1363.eqiad.wmnet, mw1357.eqiad.wmnet, mw1423.eqiad.wmnet, mw1317.eqiad.wmnet,
[10:31:35] <icinga-wm>	 eqiad.wmnet, mw1316.eqiad.wmnet, mw1379.eqiad.wmnet, mw1396.eqiad.wmnet, mw1312.eqiad.wmnet, mw1394.eqiad.wmnet, mw1383.eqiad.wmnet, mw1400.eqiad.wmnet, mw1392.eqiad.wmnet, mw1443.eqiad https://wikitech.wikimedia.org/wiki/PyBal
[10:31:40] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:31:42] <icinga-wm>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[10:31:42] <icinga-wm>	 PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out be
[10:31:42] <icinga-wm>	 esponse was received https://wikitech.wikimedia.org/wiki/Wikifeeds
[10:32:02] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:32:06] <icinga-wm>	 RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[10:32:06] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/
[10:32:06] <icinga-wm>	 ps_%28service%29
[10:32:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T318605)', diff saved to https://phabricator.wikimedia.org/P36726 and previous config saved to /var/cache/conftool/dbconfig/20221027-103214-ladsgroup.json
[10:32:17] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance
[10:32:18] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2416 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:32:20] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs3005 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:32:20] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1017 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:32:20] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[10:32:30] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:32:30] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance
[10:32:30] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:32:32] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:32:32] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs5003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:32:32] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[10:32:32] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:32:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2129 (T318605)', diff saved to https://phabricator.wikimedia.org/P36727 and previous config saved to /var/cache/conftool/dbconfig/20221027-103236-ladsgroup.json
[10:32:40] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:32:40] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:32:40] <icinga-wm>	 RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[10:32:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P36728 and previous config saved to /var/cache/conftool/dbconfig/20221027-103248-ladsgroup.json
[10:32:57] <jinxer-wm>	 (PHPFPMTooBusy) firing: (2) Not enough idle php7.4-fpm.service workers for Mediawiki api_appserver at codfw #page - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[10:33:00] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[10:33:10] <icinga-wm>	 RECOVERY - PHP7 rendering on mw2416 is OK: HTTP OK: HTTP/1.1 302 Found - 516 bytes in 0.092 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[10:33:10] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[10:34:42] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at codfw on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[10:34:52] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be1059 is CRITICAL: CRITICAL - load average: 117.09, 101.62, 59.00 https://wikitech.wikimedia.org/wiki/Swift
[10:35:14] <TheresNoTime>	 ^ ouch? :/
[10:35:18] <jinxer-wm>	 (ProbeDown) resolved: (17) Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip4) #page   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:35:19] <jinxer-wm>	 (ProbeDown) resolved: (17) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:35:20] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[10:35:35] <jinxer-wm>	 (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[10:35:35] <jinxer-wm>	 (FrontendUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[10:35:48] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.08065 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[10:35:50] <icinga-wm>	 RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[10:35:58] <logmsgbot>	 !log mvernon@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ms-be1059.eqiad.wmnet
[10:37:59] <jinxer-wm>	 (PHPFPMTooBusy) resolved: (2) Not enough idle php7.4-fpm.service workers for Mediawiki api_appserver at codfw #page - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[10:39:26] <icinga-wm>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) is CRITICAL: Test Suggest a source title to use for translation returned the unexpected status 503 (expecting: 200): /v2/suggest
[10:39:26] <icinga-wm>	 s/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[10:41:24] <icinga-wm>	 RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:41:37] <wikibugs>	 10SRE, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: Thumbnails on beta cluster return 503 Service Unavailable - https://phabricator.wikimedia.org/T321654 (10TheresNoTime) Just noting the things I've tried (unsuccessfully):  - Purging the varnish cache for this file  - Deleting some generated t...
[10:41:37] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host stat1010.eqiad.wmnet
[10:43:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P36729 and previous config saved to /var/cache/conftool/dbconfig/20221027-104348-marostegui.json
[10:43:52] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Deploy mediawiki kubernetes services - https://phabricator.wikimedia.org/T321786 (10Clement_Goubert)
[10:43:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2026 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36730 and previous config saved to /var/cache/conftool/dbconfig/20221027-104352-root.json
[10:43:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2027 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36731 and previous config saved to /var/cache/conftool/dbconfig/20221027-104356-root.json
[10:44:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2028 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36732 and previous config saved to /var/cache/conftool/dbconfig/20221027-104402-root.json
[10:44:38] <icinga-wm>	 PROBLEM - Disk space on cp5007 is CRITICAL: DISK CRITICAL - free space: / 276 MB (3% inode=62%): /tmp 276 MB (3% inode=62%): /var/tmp 276 MB (3% inode=62%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=cp5007&var-datasource=eqsin+prometheus/ops
[10:44:40] <icinga-wm>	 RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[10:44:55] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Deploy mediawiki kubernetes services - https://phabricator.wikimedia.org/T321786 (10Clement_Goubert) 05Open→03In progress p:05Triage→03High
[10:45:07] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert)
[10:45:27] <wikibugs>	 (03PS1) 10Clément Goubert: hieradata: Add usernames for mw on k8s services [puppet] - 10https://gerrit.wikimedia.org/r/850094 (https://phabricator.wikimedia.org/T321786)
[10:46:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T318950)', diff saved to https://phabricator.wikimedia.org/P36733 and previous config saved to /var/cache/conftool/dbconfig/20221027-104617-ladsgroup.json
[10:46:19] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance
[10:46:21] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance
[10:46:22] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1083 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[10:46:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T318950)', diff saved to https://phabricator.wikimedia.org/P36734 and previous config saved to /var/cache/conftool/dbconfig/20221027-104627-ladsgroup.json
[10:46:50] <icinga-wm>	 PROBLEM - Check systemd state on wcqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:47:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[10:47:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P36735 and previous config saved to /var/cache/conftool/dbconfig/20221027-104755-ladsgroup.json
[10:48:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1010.eqiad.wmnet
[10:48:35] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, but a few questions inline." [puppet] - 10https://gerrit.wikimedia.org/r/848356 (https://phabricator.wikimedia.org/T316541) (owner: 10David Caro)
[10:49:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[10:50:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T318950)', diff saved to https://phabricator.wikimedia.org/P36736 and previous config saved to /var/cache/conftool/dbconfig/20221027-105024-ladsgroup.json
[10:50:48] <wikibugs>	 (03PS1) 10Clément Goubert: admin: add mw on kubernetes namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/850095 (https://phabricator.wikimedia.org/T321786)
[10:50:57] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] p::ceph:mon: set permissions if mgr key parent dirs [puppet] - 10https://gerrit.wikimedia.org/r/849032 (https://phabricator.wikimedia.org/T321514) (owner: 10David Caro)
[10:52:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] admin: add mw on kubernetes namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/850095 (https://phabricator.wikimedia.org/T321786) (owner: 10Clément Goubert)
[10:52:30] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Duplicate monitoring for systemd::timer::job - https://phabricator.wikimedia.org/T303253 (10jcrespo) > we want to move "systemd unit failed" off Icinga and onto AM too  This is higher level, and out of scope of this ticket, but I wonder if...
[10:54:44] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:54:54] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:57:06] <wikibugs>	 (03CR) 10Svantje Lilienthal: [C: 03+1] Enable show nearby feature on de.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842353 (https://phabricator.wikimedia.org/T320692) (owner: 10WMDE-Fisch)
[10:57:26] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:57:58] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be1059 is OK: OK - load average: 36.40, 61.19, 78.66 https://wikitech.wikimedia.org/wiki/Swift
[10:58:30] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 22 Dec 2022 06:15:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:58:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P36737 and previous config saved to /var/cache/conftool/dbconfig/20221027-105855-marostegui.json
[10:59:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2027 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36738 and previous config saved to /var/cache/conftool/dbconfig/20221027-105901-root.json
[10:59:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2028 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36739 and previous config saved to /var/cache/conftool/dbconfig/20221027-105907-root.json
[10:59:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2026 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36740 and previous config saved to /var/cache/conftool/dbconfig/20221027-105910-root.json
[11:02:28] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Enable show nearby feature on de.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842353 (https://phabricator.wikimedia.org/T320692) (owner: 10WMDE-Fisch)
[11:03:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T318950)', diff saved to https://phabricator.wikimedia.org/P36742 and previous config saved to /var/cache/conftool/dbconfig/20221027-110301-ladsgroup.json
[11:03:07] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[11:03:27] <wikibugs>	 (03PS6) 10Samtar: enwiki: Raise wgPageTriageMaxAge to indefinite [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808424 (https://phabricator.wikimedia.org/T310974) (owner: 10Stang)
[11:04:18] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[11:04:32] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:05:12] <moritzm>	 !log installing nodejs security updates on buster
[11:05:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:05:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P36743 and previous config saved to /var/cache/conftool/dbconfig/20221027-110531-ladsgroup.json
[11:06:16] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[11:06:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T318605)', diff saved to https://phabricator.wikimedia.org/P36744 and previous config saved to /var/cache/conftool/dbconfig/20221027-110638-ladsgroup.json
[11:06:44] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[11:09:00] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1021.eqiad.wmnet with reason: Maintenance
[11:09:14] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1021.eqiad.wmnet with reason: Maintenance
[11:09:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1021 (T321312)', diff saved to https://phabricator.wikimedia.org/P36745 and previous config saved to /var/cache/conftool/dbconfig/20221027-110920-ladsgroup.json
[11:09:49] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2021.codfw.wmnet with reason: Maintenance
[11:10:03] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2021.codfw.wmnet with reason: Maintenance
[11:10:06] <wikibugs>	 (03PS1) 10Kosta Harlan: [labs] GrowthExperiments: Use d3.js with new impact module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850098 (https://phabricator.wikimedia.org/T318854)
[11:10:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es2021 (T321312)', diff saved to https://phabricator.wikimedia.org/P36746 and previous config saved to /var/cache/conftool/dbconfig/20221027-111009-ladsgroup.json
[11:10:22] <icinga-wm>	 RECOVERY - Host netflow1002 is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms
[11:10:54] <icinga-wm>	 PROBLEM - Check systemd state on netflow1002 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:11:21] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2097.codfw.wmnet with reason: Maintenance
[11:11:24] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2097.codfw.wmnet with reason: Maintenance
[11:11:28] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2102.codfw.wmnet with reason: Maintenance
[11:11:41] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2102.codfw.wmnet with reason: Maintenance
[11:11:45] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance
[11:11:59] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance
[11:12:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2103 (T318950)', diff saved to https://phabricator.wikimedia.org/P36747 and previous config saved to /var/cache/conftool/dbconfig/20221027-111204-ladsgroup.json
[11:12:10] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[11:14:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T321123)', diff saved to https://phabricator.wikimedia.org/P36748 and previous config saved to /var/cache/conftool/dbconfig/20221027-111401-marostegui.json
[11:14:04] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2126.codfw.wmnet with reason: Maintenance
[11:14:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2027 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36749 and previous config saved to /var/cache/conftool/dbconfig/20221027-111406-root.json
[11:14:08] <stashbot>	 T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123
[11:14:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2028 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36750 and previous config saved to /var/cache/conftool/dbconfig/20221027-111412-root.json
[11:14:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T318950)', diff saved to https://phabricator.wikimedia.org/P36751 and previous config saved to /var/cache/conftool/dbconfig/20221027-111414-ladsgroup.json
[11:14:17] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2126.codfw.wmnet with reason: Maintenance
[11:14:19] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db2095.codfw.wmnet with reason: Maintenance
[11:14:21] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2095.codfw.wmnet with reason: Maintenance
[11:14:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2026 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36752 and previous config saved to /var/cache/conftool/dbconfig/20221027-111422-root.json
[11:14:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2126 (T321123)', diff saved to https://phabricator.wikimedia.org/P36753 and previous config saved to /var/cache/conftool/dbconfig/20221027-111427-marostegui.json
[11:15:04] <logmsgbot>	 !log oblivian@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs4005.ulsfo.wmnet} and A:lvs
[11:15:06] <logmsgbot>	 !log oblivian@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs4005.ulsfo.wmnet} and A:lvs
[11:16:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T321123)', diff saved to https://phabricator.wikimedia.org/P36754 and previous config saved to /var/cache/conftool/dbconfig/20221027-111653-marostegui.json
[11:20:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P36755 and previous config saved to /var/cache/conftool/dbconfig/20221027-112037-ladsgroup.json
[11:21:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P36756 and previous config saved to /var/cache/conftool/dbconfig/20221027-112144-ladsgroup.json
[11:22:20] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: cloudgw2003-dev: give proper role [puppet] - 10https://gerrit.wikimedia.org/r/838125 (https://phabricator.wikimedia.org/T318704)
[11:22:28] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on netflow1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[11:23:16] <icinga-wm>	 RECOVERY - Check systemd state on wcqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:23:43] <wikibugs>	 (03PS1) 10Stang: Define a default value for wgPageTriageMaxAge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850106 (https://phabricator.wikimedia.org/T310974)
[11:23:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/850105 (owner: 10L10n-bot)
[11:24:02] <koi>	 TheresNoTime: ^
[11:24:09] <TheresNoTime>	 thanks!
[11:24:17] <logmsgbot>	 !log oblivian@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs4005.ulsfo.wmnet} and A:lvs
[11:24:18] <logmsgbot>	 !log oblivian@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs4005.ulsfo.wmnet} and A:lvs
[11:25:05] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: cloudgw2003-dev: give proper role [puppet] - 10https://gerrit.wikimedia.org/r/838125 (https://phabricator.wikimedia.org/T318704)
[11:26:25] <wikibugs>	 (03CR) 10Novem Linguae: "Isn't this already covered by https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/808424/ ? That patch has both a default value" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850106 (https://phabricator.wikimedia.org/T310974) (owner: 10Stang)
[11:26:28] <wikibugs>	 (03PS5) 10Arturo Borrero Gonzalez: cloudgw2003-dev: give proper role and take over cloudgw2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/838125 (https://phabricator.wikimedia.org/T318704)
[11:26:35] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: sre.loadbalancer.restart-pybal: fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/850110
[11:26:53] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.loadbalancer.restart-pybal: fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/850110 (owner: 10Giuseppe Lavagetto)
[11:27:35] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LOL, sorry missed that" [cookbooks] - 10https://gerrit.wikimedia.org/r/850110 (owner: 10Giuseppe Lavagetto)
[11:27:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1021 (T321312)', diff saved to https://phabricator.wikimedia.org/P36757 and previous config saved to /var/cache/conftool/dbconfig/20221027-112740-ladsgroup.json
[11:29:00] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: sre.loadbalancer.restart-pybal: fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/850110
[11:29:06] <icinga-wm>	 PROBLEM - Host netflow1002 is DOWN: PING CRITICAL - Packet loss = 100%
[11:29:09] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] sre.loadbalancer.restart-pybal: fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/850110 (owner: 10Giuseppe Lavagetto)
[11:29:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2027 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36758 and previous config saved to /var/cache/conftool/dbconfig/20221027-112911-root.json
[11:29:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2028 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36759 and previous config saved to /var/cache/conftool/dbconfig/20221027-112917-root.json
[11:29:20] <icinga-wm>	 PROBLEM - Check systemd state on wcqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:29:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2026 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36760 and previous config saved to /var/cache/conftool/dbconfig/20221027-112927-root.json
[11:29:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P36761 and previous config saved to /var/cache/conftool/dbconfig/20221027-112927-ladsgroup.json
[11:31:09] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/850092 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff)
[11:31:36] <wikibugs>	 (03PS3) 10Majavah: admin: Add wmcs-roots to cloudgw nodes [puppet] - 10https://gerrit.wikimedia.org/r/745952
[11:32:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P36762 and previous config saved to /var/cache/conftool/dbconfig/20221027-113159-marostegui.json
[11:32:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of netflow1002.eqiad.wmnet to plain
[11:35:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow1002.eqiad.wmnet to plain
[11:35:03] <wikibugs>	 (03Merged) 10jenkins-bot: sre.loadbalancer.restart-pybal: fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/850110 (owner: 10Giuseppe Lavagetto)
[11:35:17] <wikibugs>	 (03CR) 10Samtar: [C: 03+1] "lgtm!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850106 (https://phabricator.wikimedia.org/T310974) (owner: 10Stang)
[11:35:26] <logmsgbot>	 !log oblivian@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs4005.ulsfo.wmnet} and A:lvs
[11:35:27] <logmsgbot>	 !log oblivian@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs4005.ulsfo.wmnet} and A:lvs
[11:35:31] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] admin: Add wmcs-roots to cloudgw nodes [puppet] - 10https://gerrit.wikimedia.org/r/745952 (owner: 10Majavah)
[11:35:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T318950)', diff saved to https://phabricator.wikimedia.org/P36763 and previous config saved to /var/cache/conftool/dbconfig/20221027-113544-ladsgroup.json
[11:35:46] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1189.eqiad.wmnet with reason: Maintenance
[11:35:48] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1189.eqiad.wmnet with reason: Maintenance
[11:35:50] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[11:35:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1189 (T318950)', diff saved to https://phabricator.wikimedia.org/P36764 and previous config saved to /var/cache/conftool/dbconfig/20221027-113554-ladsgroup.json
[11:36:08] <wikibugs>	 (03PS4) 10Klausman: [WIP] wikilabels: move Postgres DB to its own (non-wmcs) role [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389)
[11:36:17] <wikibugs>	 (03CR) 10Klausman: [WIP] wikilabels: move Postgres DB to its own (non-wmcs) role (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman)
[11:36:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP] wikilabels: move Postgres DB to its own (non-wmcs) role [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman)
[11:36:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P36765 and previous config saved to /var/cache/conftool/dbconfig/20221027-113651-ladsgroup.json
[11:38:03] <logmsgbot>	 !log oblivian@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs4005.ulsfo.wmnet} and A:lvs
[11:38:53] <wikibugs>	 (03PS5) 10Klausman: [WIP] wikilabels: move Postgres DB to its own (non-wmcs) role [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389)
[11:39:24] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs4005 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[11:39:52] <logmsgbot>	 !log oblivian@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs4005.ulsfo.wmnet} and A:lvs
[11:40:38] <wikibugs>	 (03PS7) 10Samtar: enwiki: Raise wgPageTriageMaxAge to indefinite [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808424 (https://phabricator.wikimedia.org/T310974) (owner: 10Stang)
[11:40:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] enwiki: Raise wgPageTriageMaxAge to indefinite [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808424 (https://phabricator.wikimedia.org/T310974) (owner: 10Stang)
[11:41:06] <wikibugs>	 (03PS6) 10Arturo Borrero Gonzalez: cloudgw2003-dev: give proper role and take over cloudgw2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/838125 (https://phabricator.wikimedia.org/T318704)
[11:41:40] <wikibugs>	 (03PS8) 10Stang: enwiki: Raise wgPageTriageMaxAge to indefinite [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808424 (https://phabricator.wikimedia.org/T310974)
[11:42:37] <wikibugs>	 (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37802/console" [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman)
[11:42:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1021', diff saved to https://phabricator.wikimedia.org/P36767 and previous config saved to /var/cache/conftool/dbconfig/20221027-114246-ladsgroup.json
[11:43:16] <wikibugs>	 (03PS7) 10Arturo Borrero Gonzalez: cloudgw2003-dev: give proper role and take over cloudgw2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/838125 (https://phabricator.wikimedia.org/T318704)
[11:43:55] <volans>	 thx
[11:44:03] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: lists.wikimedia.org returns "Internal Server Error" for some pages - https://phabricator.wikimedia.org/T321703 (10jcrespo) https://lists.wikimedia.org/ having issues again? I get a response, but it takes 47-48 seconds to return a 301.
[11:44:03] * volans wrong chan
[11:44:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2027 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36768 and previous config saved to /var/cache/conftool/dbconfig/20221027-114416-root.json
[11:44:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2028 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36769 and previous config saved to /var/cache/conftool/dbconfig/20221027-114422-root.json
[11:44:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2026 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36770 and previous config saved to /var/cache/conftool/dbconfig/20221027-114432-root.json
[11:45:39] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: wikimediacloud.org: refresh cloudgw server [dns] - 10https://gerrit.wikimedia.org/r/850116 (https://phabricator.wikimedia.org/T318704)
[11:47:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P36771 and previous config saved to /var/cache/conftool/dbconfig/20221027-114706-marostegui.json
[11:47:07] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: lists.wikimedia.org returns "Internal Server Error" for some pages - https://phabricator.wikimedia.org/T321703 (10Ladsgroup) It has been slow, I haven't looked why, it can be either of these two:  - Somehow db responses are slow (many junk users created? Many junk emails have...
[11:49:57] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting deployment group membership for mfossati - https://phabricator.wikimedia.org/T321772 (10SLyngshede-WMF) p:05Triage→03Medium
[11:51:22] <wikibugs>	 10SRE, 10Data-Engineering-Operations, 10Data-Engineering-Planning, 10Mail, 10Patch-For-Review: Change the analytics-alerts email alias to a mailman distribution list - https://phabricator.wikimedia.org/T315486 (10BTullis) OK, thanks all. I'll make that change to the exim aliases file: `analytics-alerts:...
[11:51:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T318605)', diff saved to https://phabricator.wikimedia.org/P36772 and previous config saved to /var/cache/conftool/dbconfig/20221027-115157-ladsgroup.json
[11:51:59] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance
[11:52:04] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[11:52:13] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance
[11:52:14] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (20) node(s) change every puppet run: aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, clouddumps1001, clouddumps1002, phab1004, releases1002, releases2002, relforge1003, relforge1004, wcqs2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_change
[11:52:28] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1083 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[11:53:10] <icinga-wm>	 PROBLEM - Check systemd state on cp5007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:53:54] <icinga-wm>	 RECOVERY - Check systemd state on wcqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:56:05] <wikibugs>	 (03CR) 10WMDE-Fisch: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850118 (https://phabricator.wikimedia.org/T321307) (owner: 10WMDE-Fisch)
[11:56:53] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Revert "prometheus: temp disable mgmt checks until hiera export script is fixed" [puppet] - 10https://gerrit.wikimedia.org/r/850075
[11:57:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1021', diff saved to https://phabricator.wikimedia.org/P36773 and previous config saved to /var/cache/conftool/dbconfig/20221027-115753-ladsgroup.json
[11:59:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2027 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36774 and previous config saved to /var/cache/conftool/dbconfig/20221027-115920-root.json
[11:59:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2028 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36775 and previous config saved to /var/cache/conftool/dbconfig/20221027-115927-root.json
[11:59:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2026 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36776 and previous config saved to /var/cache/conftool/dbconfig/20221027-115936-root.json
[11:59:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T318950)', diff saved to https://phabricator.wikimedia.org/P36777 and previous config saved to /var/cache/conftool/dbconfig/20221027-115939-ladsgroup.json
[11:59:41] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2116.codfw.wmnet with reason: Maintenance
[11:59:46] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[11:59:55] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2116.codfw.wmnet with reason: Maintenance
[11:59:56] <icinga-wm>	 PROBLEM - Check systemd state on wcqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:00:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2116 (T318950)', diff saved to https://phabricator.wikimedia.org/P36778 and previous config saved to /var/cache/conftool/dbconfig/20221027-120001-ladsgroup.json
[12:00:10] <jinxer-wm>	 (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported
[12:00:29] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "prometheus: temp disable mgmt checks until hiera export script is fixed" [puppet] - 10https://gerrit.wikimedia.org/r/850075 (owner: 10Filippo Giunchedi)
[12:00:34] <wikibugs>	 (03PS2) 10Filippo Giunchedi: Revert "prometheus: temp disable mgmt checks until hiera export script is fixed" [puppet] - 10https://gerrit.wikimedia.org/r/850075
[12:00:57] <logmsgbot>	 !log oblivian@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on D{lvs200[7-8].codfw.wmnet} and A:lvs
[12:01:53] <logmsgbot>	 !log oblivian@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on D{lvs200[7-8].codfw.wmnet} and A:lvs
[12:02:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T318950)', diff saved to https://phabricator.wikimedia.org/P36780 and previous config saved to /var/cache/conftool/dbconfig/20221027-120211-ladsgroup.json
[12:02:14] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2138.codfw.wmnet with reason: Maintenance
[12:02:18] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: lists.wikimedia.org returns "Internal Server Error" for some pages - https://phabricator.wikimedia.org/T321703 (10jcrespo) Sample traffic (under NDA):  {P36779}
[12:02:28] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2138.codfw.wmnet with reason: Maintenance
[12:02:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3312 (T321123)', diff saved to https://phabricator.wikimedia.org/P36781 and previous config saved to /var/cache/conftool/dbconfig/20221027-120234-marostegui.json
[12:02:41] <stashbot>	 T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123
[12:03:26] <wikibugs>	 (03CR) 10David Caro: "Do you have to change https://gerrit.wikimedia.org/g/operations/puppet/+/1cb0f1e4cf777795474dae711f03ed167949c3d3/hieradata/codfw/profile/" [puppet] - 10https://gerrit.wikimedia.org/r/838125 (https://phabricator.wikimedia.org/T318704) (owner: 10Arturo Borrero Gonzalez)
[12:03:28] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1083 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[12:04:12] <wikibugs>	 (03CR) 10David Caro: cloudgw2003-dev: give proper role and take over cloudgw2001-dev (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/838125 (https://phabricator.wikimedia.org/T318704) (owner: 10Arturo Borrero Gonzalez)
[12:04:58] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/850116 (https://phabricator.wikimedia.org/T318704) (owner: 10Arturo Borrero Gonzalez)
[12:05:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T321123)', diff saved to https://phabricator.wikimedia.org/P36782 and previous config saved to /var/cache/conftool/dbconfig/20221027-120500-marostegui.json
[12:07:43] <wikibugs>	 (03PS1) 10Ladsgroup: lists: Ban PetalBot from crawling [puppet] - 10https://gerrit.wikimedia.org/r/850119 (https://phabricator.wikimedia.org/T321703)
[12:07:52] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 22 Dec 2022 06:15:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:08:54] <wikibugs>	 (03PS8) 10Arturo Borrero Gonzalez: cloudgw2003-dev: give proper role and take over cloudgw2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/838125 (https://phabricator.wikimedia.org/T318704)
[12:09:14] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] lists: Ban PetalBot from crawling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850119 (https://phabricator.wikimedia.org/T321703) (owner: 10Ladsgroup)
[12:11:10] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: cloudgw2003-dev: give proper role and take over cloudgw2001-dev (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/838125 (https://phabricator.wikimedia.org/T318704) (owner: 10Arturo Borrero Gonzalez)
[12:11:20] <wikibugs>	 (03CR) 10Dzahn: doc: add parameters for gitlab_runner and contint hosts, allow them (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn)
[12:13:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1021 (T321312)', diff saved to https://phabricator.wikimedia.org/P36783 and previous config saved to /var/cache/conftool/dbconfig/20221027-121259-ladsgroup.json
[12:13:04] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1022.eqiad.wmnet with reason: Maintenance
[12:13:17] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1022.eqiad.wmnet with reason: Maintenance
[12:13:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1022 (T321312)', diff saved to https://phabricator.wikimedia.org/P36784 and previous config saved to /var/cache/conftool/dbconfig/20221027-121323-ladsgroup.json
[12:13:55] <jinxer-wm>	 (LogstashIngestSpike) firing: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[12:14:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2027 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36785 and previous config saved to /var/cache/conftool/dbconfig/20221027-121425-root.json
[12:14:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2028 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36786 and previous config saved to /var/cache/conftool/dbconfig/20221027-121432-root.json
[12:14:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2026 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36787 and previous config saved to /var/cache/conftool/dbconfig/20221027-121441-root.json
[12:15:09] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wikimediacloud.org: refresh cloudgw server [dns] - 10https://gerrit.wikimedia.org/r/850116 (https://phabricator.wikimedia.org/T318704) (owner: 10Arturo Borrero Gonzalez)
[12:15:34] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[12:15:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2021 (T321312)', diff saved to https://phabricator.wikimedia.org/P36788 and previous config saved to /var/cache/conftool/dbconfig/20221027-121550-ladsgroup.json
[12:16:56] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.292 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:17:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P36789 and previous config saved to /var/cache/conftool/dbconfig/20221027-121717-ladsgroup.json
[12:18:01] <wikibugs>	 (03PS2) 10Ladsgroup: lists: Ban PetalBot from crawling [puppet] - 10https://gerrit.wikimedia.org/r/850119 (https://phabricator.wikimedia.org/T321703)
[12:18:14] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48827 bytes in 0.108 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:18:55] <jinxer-wm>	 (LogstashIngestSpike) resolved: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[12:19:36] <wikibugs>	 (03PS6) 10Klausman: [WIP] wikilabels: move Postgres DB to its own (non-wmcs) role [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389)
[12:20:00] <wikibugs>	 (03PS8) 10Dzahn: doc: add parameters for gitlab_runner and contint hosts, allow them [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629)
[12:20:02] <wikibugs>	 (03CR) 10Dzahn: doc: add parameters for gitlab_runner and contint hosts, allow them (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn)
[12:20:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P36790 and previous config saved to /var/cache/conftool/dbconfig/20221027-122007-marostegui.json
[12:20:26] <wikibugs>	 (03PS3) 10Ladsgroup: lists: Ban PetalBot from crawling [puppet] - 10https://gerrit.wikimedia.org/r/850119 (https://phabricator.wikimedia.org/T321703)
[12:20:30] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] lists: Ban PetalBot from crawling [puppet] - 10https://gerrit.wikimedia.org/r/850119 (https://phabricator.wikimedia.org/T321703) (owner: 10Ladsgroup)
[12:20:41] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] lists: Ban PetalBot from crawling [puppet] - 10https://gerrit.wikimedia.org/r/850119 (https://phabricator.wikimedia.org/T321703) (owner: 10Ladsgroup)
[12:21:25] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] [beta] Enable Kartographer show nearby clustering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850118 (https://phabricator.wikimedia.org/T321307) (owner: 10WMDE-Fisch)
[12:22:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] doc: add parameters for gitlab_runner and contint hosts, allow them [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn)
[12:22:55] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: lists.wikimedia.org returns "Internal Server Error" for some pages - https://phabricator.wikimedia.org/T321703 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup For longer term, I'd like to add a couple more cores to this poor tiny VM that has two cores...
[12:23:32] <icinga-wm>	 PROBLEM - SSH on mw1310.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:23:32] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "appledora changed the email in LDAP to the -ctr address. Let's get this merged" [puppet] - 10https://gerrit.wikimedia.org/r/845033 (https://phabricator.wikimedia.org/T321086) (owner: 10Herron)
[12:23:49] <wikibugs>	 (03PS7) 10Klausman: [WIP] wikilabels: move Postgres DB to its own (non-wmcs) role [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389)
[12:24:27] <wikibugs>	 (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37803/console" [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman)
[12:25:43] <wikibugs>	 (03PS2) 10Dzahn: admin: add appledora to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/845033 (https://phabricator.wikimedia.org/T321086) (owner: 10Herron)
[12:25:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T318950)', diff saved to https://phabricator.wikimedia.org/P36791 and previous config saved to /var/cache/conftool/dbconfig/20221027-122557-ladsgroup.json
[12:26:04] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[12:26:28] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "amended to update email address" [puppet] - 10https://gerrit.wikimedia.org/r/845033 (https://phabricator.wikimedia.org/T321086) (owner: 10Herron)
[12:27:23] <wikibugs>	 (03CR) 10Dzahn: "nope. after adding the type back and using alias it's still back to "parameter 'gitlab_runner_hosts' expects an Array value, got String "" [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn)
[12:28:34] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (20) node(s) change every puppet run: aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, clouddumps1001, clouddumps1002, phab1004, releases1002, releases2002, relforge1003, relforge1004, wcqs2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_change
[12:28:42] <wikibugs>	 (03PS1) 10Ladsgroup: maintenance: Use $this->waitForReplication() [core] (wmf/1.40.0-wmf.7) - 10https://gerrit.wikimedia.org/r/850079 (https://phabricator.wikimedia.org/T298485)
[12:28:59] <Amir1>	 jouncebot: nowandnext
[12:28:59] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 31 minute(s)
[12:29:00] <jouncebot>	 In 0 hour(s) and 31 minute(s): Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221027T1300)
[12:29:00] <jouncebot>	 In 0 hour(s) and 31 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221027T1300)
[12:29:18] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] maintenance: Use $this->waitForReplication() [core] (wmf/1.40.0-wmf.7) - 10https://gerrit.wikimedia.org/r/850079 (https://phabricator.wikimedia.org/T298485) (owner: 10Ladsgroup)
[12:29:28] <wikibugs>	 (03CR) 10Nikerabbit: [V: 03+2] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/850105 (owner: 10L10n-bot)
[12:30:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2021', diff saved to https://phabricator.wikimedia.org/P36792 and previous config saved to /var/cache/conftool/dbconfig/20221027-123057-ladsgroup.json
[12:32:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P36793 and previous config saved to /var/cache/conftool/dbconfig/20221027-123224-ladsgroup.json
[12:35:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P36794 and previous config saved to /var/cache/conftool/dbconfig/20221027-123513-marostegui.json
[12:41:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P36795 and previous config saved to /var/cache/conftool/dbconfig/20221027-124104-ladsgroup.json
[12:42:02] <icinga-wm>	 RECOVERY - Check systemd state on cp5007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:42:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1022 (T321312)', diff saved to https://phabricator.wikimedia.org/P36796 and previous config saved to /var/cache/conftool/dbconfig/20221027-124255-ladsgroup.json
[12:44:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] maintenance: Use $this->waitForReplication() [core] (wmf/1.40.0-wmf.7) - 10https://gerrit.wikimedia.org/r/850079 (https://phabricator.wikimedia.org/T298485) (owner: 10Ladsgroup)
[12:44:14] <jinxer-wm>	 (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:45:33] <wikibugs>	 (03Merged) 10jenkins-bot: maintenance: Use $this->waitForReplication() [core] (wmf/1.40.0-wmf.7) - 10https://gerrit.wikimedia.org/r/850079 (https://phabricator.wikimedia.org/T298485) (owner: 10Ladsgroup)
[12:46:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2021', diff saved to https://phabricator.wikimedia.org/P36797 and previous config saved to /var/cache/conftool/dbconfig/20221027-124603-ladsgroup.json
[12:47:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T318950)', diff saved to https://phabricator.wikimedia.org/P36798 and previous config saved to /var/cache/conftool/dbconfig/20221027-124731-ladsgroup.json
[12:47:33] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2130.codfw.wmnet with reason: Maintenance
[12:47:39] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[12:47:46] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2130.codfw.wmnet with reason: Maintenance
[12:47:48] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:850079|maintenance: Use $this->waitForReplication() (T298485)]]
[12:47:51] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.7) - 10https://gerrit.wikimedia.org/r/850079 (https://phabricator.wikimedia.org/T298485) (owner: 10Ladsgroup)
[12:47:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2130 (T318950)', diff saved to https://phabricator.wikimedia.org/P36799 and previous config saved to /var/cache/conftool/dbconfig/20221027-124752-ladsgroup.json
[12:47:56] <stashbot>	 T298485: MW scripts should reload the database config - https://phabricator.wikimedia.org/T298485
[12:48:09] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:850079|maintenance: Use $this->waitForReplication() (T298485)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet
[12:48:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[12:49:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[12:49:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[12:49:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[12:50:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T318950)', diff saved to https://phabricator.wikimedia.org/P36800 and previous config saved to /var/cache/conftool/dbconfig/20221027-125002-ladsgroup.json
[12:50:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T321123)', diff saved to https://phabricator.wikimedia.org/P36801 and previous config saved to /var/cache/conftool/dbconfig/20221027-125020-marostegui.json
[12:50:22] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2148.codfw.wmnet with reason: Maintenance
[12:50:26] <stashbot>	 T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123
[12:50:35] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2148.codfw.wmnet with reason: Maintenance
[12:50:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2148 (T321123)', diff saved to https://phabricator.wikimedia.org/P36802 and previous config saved to /var/cache/conftool/dbconfig/20221027-125042-marostegui.json
[12:52:16] <icinga-wm>	 PROBLEM - Check systemd state on cp5007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:52:29] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:850079|maintenance: Use $this->waitForReplication() (T298485)]] (duration: 04m 40s)
[12:53:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T321123)', diff saved to https://phabricator.wikimedia.org/P36803 and previous config saved to /var/cache/conftool/dbconfig/20221027-125307-marostegui.json
[12:54:32] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance
[12:54:46] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance
[12:54:47] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance
[12:54:50] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance
[12:54:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2158 (T318605)', diff saved to https://phabricator.wikimedia.org/P36804 and previous config saved to /var/cache/conftool/dbconfig/20221027-125456-ladsgroup.json
[12:55:04] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[12:55:28] <wikibugs>	 (03PS2) 10Vgutierrez: trafficserver: Clean up after ATS 9.x upgrade [puppet] - 10https://gerrit.wikimedia.org/r/850087 (https://phabricator.wikimedia.org/T321776)
[12:56:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P36805 and previous config saved to /var/cache/conftool/dbconfig/20221027-125610-ladsgroup.json
[12:58:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1022', diff saved to https://phabricator.wikimedia.org/P36806 and previous config saved to /var/cache/conftool/dbconfig/20221027-125801-ladsgroup.json
[13:00:05] <jouncebot>	 Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221027T1300)
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: Dear deployers, time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221027T1300).
[13:00:05] <jouncebot>	 Sohom_Datta, WMDE-Fisch, and koi: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:13] <Lucas_WMDE>	 o/
[13:00:19] * urbanecm waves
[13:00:19] <WMDE-Fisch>	 o/
[13:00:28] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch idp.w.o to idp2002 [dns] - 10https://gerrit.wikimedia.org/r/850148
[13:00:32] <Sohom_Datta>	 o/
[13:00:34] <urbanecm>	 Lucas_WMDE: will you deploy, or should i?
[13:00:43] <Lucas_WMDE>	 I’m fine either way :)
[13:00:57] <urbanecm>	 tbh i prefer someone else deploying today, currently in a middle of something else
[13:01:01] <Lucas_WMDE>	 ok, I can do it
[13:01:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] debian: Add bookworm [puppet] - 10https://gerrit.wikimedia.org/r/850092 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff)
[13:01:08] <urbanecm>	 thanks
[13:01:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2021 (T321312)', diff saved to https://phabricator.wikimedia.org/P36807 and previous config saved to /var/cache/conftool/dbconfig/20221027-130110-ladsgroup.json
[13:01:16] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2022.codfw.wmnet with reason: Maintenance
[13:01:29] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2022.codfw.wmnet with reason: Maintenance
[13:01:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es2022 (T321312)', diff saved to https://phabricator.wikimedia.org/P36808 and previous config saved to /var/cache/conftool/dbconfig/20221027-130135-ladsgroup.json
[13:02:25] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "🎉 thanks for doing this!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/842819 (https://phabricator.wikimedia.org/T316348) (owner: 10Clément Goubert)
[13:03:46] <vgutierrez>	 !log depool cp5007
[13:03:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:45] <wikibugs>	 (03PS4) 10Lucas Werkmeister (WMDE): Enable source links on Translation ns on bnwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849672 (https://phabricator.wikimedia.org/T53980) (owner: 10Bodhisattwa)
[13:04:53] <koi>	 o/
[13:05:02] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849672 (https://phabricator.wikimedia.org/T53980) (owner: 10Bodhisattwa)
[13:05:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P36809 and previous config saved to /var/cache/conftool/dbconfig/20221027-130509-ladsgroup.json
[13:06:03] <wikibugs>	 (03Merged) 10jenkins-bot: Enable source links on Translation ns on bnwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849672 (https://phabricator.wikimedia.org/T53980) (owner: 10Bodhisattwa)
[13:06:19] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:849672|Enable source links on Translation ns on bnwikisource (T53980)]]
[13:06:24] <stashbot>	 T53980: Source tab not showing up in the Translation namespace - https://phabricator.wikimedia.org/T53980
[13:06:38] <icinga-wm>	 RECOVERY - Check systemd state on cp5007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:06:38] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and bodhisattwa: Backport for [[gerrit:849672|Enable source links on Translation ns on bnwikisource (T53980)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet
[13:06:39] <wikibugs>	 (03CR) 10Muehlenhoff: check_systemd_state: consume exclusion list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849929 (https://phabricator.wikimedia.org/T303253) (owner: 10Giuseppe Lavagetto)
[13:07:23] <Lucas_WMDE>	 Sohom_Datta: can you test the change on mwdebug?
[13:07:55] <Sohom_Datta>	 Yep, I can see the changes on mwdebug, works fine :)
[13:08:00] <Lucas_WMDE>	 yay
[13:08:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P36810 and previous config saved to /var/cache/conftool/dbconfig/20221027-130814-marostegui.json
[13:08:16] <icinga-wm>	 RECOVERY - Disk space on cp5007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=cp5007&var-datasource=eqsin+prometheus/ops
[13:09:08] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] Enable show nearby feature on de.wikivoyage (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842353 (https://phabricator.wikimedia.org/T320692) (owner: 10WMDE-Fisch)
[13:09:56] <WMDE-Fisch>	 Whoops
[13:10:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:10:20] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp5007 is CRITICAL: reload-vcl failed to run since 0h, 6 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[13:10:32] <icinga-wm>	 RECOVERY - Check systemd state on idp-test1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:11:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:11:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:11:16] <icinga-wm>	 PROBLEM - SSH on mw1332.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:11:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T318950)', diff saved to https://phabricator.wikimedia.org/P36811 and previous config saved to /var/cache/conftool/dbconfig/20221027-131117-ladsgroup.json
[13:11:19] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1198.eqiad.wmnet with reason: Maintenance
[13:11:21] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1198.eqiad.wmnet with reason: Maintenance
[13:11:23] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[13:11:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1198 (T318950)', diff saved to https://phabricator.wikimedia.org/P36812 and previous config saved to /var/cache/conftool/dbconfig/20221027-131127-ladsgroup.json
[13:11:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:11:59] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:849672|Enable source links on Translation ns on bnwikisource (T53980)]] (duration: 05m 40s)
[13:12:07] <stashbot>	 T53980: Source tab not showing up in the Translation namespace - https://phabricator.wikimedia.org/T53980
[13:12:09] <vgutierrez>	 !log pool cp5007
[13:12:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:12:24] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp5007 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[13:13:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1022', diff saved to https://phabricator.wikimedia.org/P36813 and previous config saved to /var/cache/conftool/dbconfig/20221027-131308-ladsgroup.json
[13:13:16] <wikibugs>	 (03PS4) 10WMDE-Fisch: Enable show nearby feature on de.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842353 (https://phabricator.wikimedia.org/T320692)
[13:14:06] <Lucas_WMDE>	 skipping WMDE-Fisch for a second and continuing with koi 
[13:14:10] <Lucas_WMDE>	 oh, nevermind
[13:14:11] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "I actually use codesearch :)" [puppet] - 10https://gerrit.wikimedia.org/r/838125 (https://phabricator.wikimedia.org/T318704) (owner: 10Arturo Borrero Gonzalez)
[13:14:15] <Lucas_WMDE>	 :D
[13:14:16] <WMDE-Fisch>	 Lucas_WMDE: Updated
[13:14:19] <WMDE-Fisch>	 :-)
[13:14:45] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Enable show nearby feature on de.wikivoyage (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842353 (https://phabricator.wikimedia.org/T320692) (owner: 10WMDE-Fisch)
[13:14:50] <wikibugs>	 (03PS5) 10Lucas Werkmeister (WMDE): Enable show nearby feature on de.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842353 (https://phabricator.wikimedia.org/T320692) (owner: 10WMDE-Fisch)
[13:14:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842353 (https://phabricator.wikimedia.org/T320692) (owner: 10WMDE-Fisch)
[13:15:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T318950)', diff saved to https://phabricator.wikimedia.org/P36814 and previous config saved to /var/cache/conftool/dbconfig/20221027-131524-ladsgroup.json
[13:16:16] <wikibugs>	 (03Merged) 10jenkins-bot: Enable show nearby feature on de.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842353 (https://phabricator.wikimedia.org/T320692) (owner: 10WMDE-Fisch)
[13:16:30] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:842353|Enable show nearby feature on de.wikivoyage (T320692)]]
[13:16:36] <stashbot>	 T320692: Disable Wikivoyage nearby and enable Show Nearby on de.wikivoyage - https://phabricator.wikimedia.org/T320692
[13:16:49] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and wmde-fisch: Backport for [[gerrit:842353|Enable show nearby feature on de.wikivoyage (T320692)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet
[13:17:51] <Lucas_WMDE>	 WMDE-Fisch: can you check that it’s working on mwdebug?
[13:18:04] <WMDE-Fisch>	 Lucas_WMDE: Tested on mwdebug, works fine. Please go on :-)
[13:18:09] <Lucas_WMDE>	 yay
[13:18:33] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "Matches the default in PageTriage’s extension.json." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850106 (https://phabricator.wikimedia.org/T310974) (owner: 10Stang)
[13:19:43] <koi>	 Hi Lucas_WMDE, TheresNoTime said this patch is not really testable
[13:19:52] <Lucas_WMDE>	 yeah, makes sense
[13:19:58] <koi>	 and just need to " i.e. check https://grafana-rw.wikimedia.org/d/GDZR_4IVz/pagetriage-debugging?orgId=1&from=now-7d&to=now&refresh=1m and make sure the NOINDEX graph doesn't dramatically drop/rise"
[13:20:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P36815 and previous config saved to /var/cache/conftool/dbconfig/20221027-132016-ladsgroup.json
[13:20:58] <Lucas_WMDE>	 I assume that’s something that would happen over the course of days rather than minutes
[13:21:01] <wikibugs>	 10SRE, 10Traffic: Upgrade HAProxy on cp nodes to 2.6.x LTS - https://phabricator.wikimedia.org/T321775 (10CDanis)
[13:21:02] <Lucas_WMDE>	 (as articles slowly get re-parsed)
[13:21:09] <wikibugs>	 (03CR) 10Vgutierrez: "https://puppet-compiler.wmflabs.org/pcc-worker1003/37804/ errors for deployment-cache-text07 are expected. We need to update the hieradata" [puppet] - 10https://gerrit.wikimedia.org/r/850087 (https://phabricator.wikimedia.org/T321776) (owner: 10Vgutierrez)
[13:21:14] <wikibugs>	 (03PS2) 10Muehlenhoff: wmcs::nfs: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842758 (https://phabricator.wikimedia.org/T308013)
[13:21:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2022 (T321312)', diff saved to https://phabricator.wikimedia.org/P36816 and previous config saved to /var/cache/conftool/dbconfig/20221027-132148-ladsgroup.json
[13:22:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:22:06] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:842353|Enable show nearby feature on de.wikivoyage (T320692)]] (duration: 05m 35s)
[13:22:15] <stashbot>	 T320692: Disable Wikivoyage nearby and enable Show Nearby on de.wikivoyage - https://phabricator.wikimedia.org/T320692
[13:22:32] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Define a default value for wgPageTriageMaxAge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850106 (https://phabricator.wikimedia.org/T310974) (owner: 10Stang)
[13:22:40] <TheresNoTime>	 (good point ref. cache/re-parse Lucas_WMDE)
[13:22:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850106 (https://phabricator.wikimedia.org/T310974) (owner: 10Stang)
[13:23:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:23:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:23:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P36817 and previous config saved to /var/cache/conftool/dbconfig/20221027-132320-marostegui.json
[13:23:36] <icinga-wm>	 RECOVERY - Check systemd state on wcqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:23:38] <wikibugs>	 (03Merged) 10jenkins-bot: Define a default value for wgPageTriageMaxAge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850106 (https://phabricator.wikimedia.org/T310974) (owner: 10Stang)
[13:23:50] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:850106|Define a default value for wgPageTriageMaxAge (T310974)]]
[13:23:57] <stashbot>	 T310974: Extend PageTriageMaxAge (noindex) for unpatrolled articles at enwiki - https://phabricator.wikimedia.org/T310974
[13:23:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:24:09] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and stang: Backport for [[gerrit:850106|Define a default value for wgPageTriageMaxAge (T310974)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet
[13:24:30] <icinga-wm>	 RECOVERY - SSH on mw1310.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:24:52] <Lucas_WMDE>	 I purged two articles on mwdebug and they didn’t get noindex
[13:25:07] <Lucas_WMDE>	 I don’t know PageTriage enough to find pages that should have (and keep) noindex
[13:25:23] <TheresNoTime>	 (all should stay the same with that config variable)
[13:25:26] <Lucas_WMDE>	 I’ll just continue with the sync
[13:25:32] <TheresNoTime>	 ack, sounds good
[13:26:37] <Lucas_WMDE>	 tbh I’m not sure this change is actually needed – maybe it would’ve been enough to use 'default' => null, 'enwiki' => 0 in the other change?
[13:26:49] <Lucas_WMDE>	 (null should mean that the setting isn’t added at all on most wiki, leaving the extension default in place)
[13:26:59] <Lucas_WMDE>	 but maybe it’s better to have the default explicit
[13:27:02] <wikibugs>	 (03CR) 10Jbond: doc: add parameters for gitlab_runner and contint hosts, allow them (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn)
[13:27:06] <wikibugs>	 (03PS9) 10Jbond: doc: add parameters for gitlab_runner and contint hosts, allow them [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn)
[13:27:10] <wikibugs>	 (03PS1) 10Jbond: O:docker_registry_ha::registry: move defaults to common section [puppet] - 10https://gerrit.wikimedia.org/r/850153
[13:27:31] <TheresNoTime>	 Lucas_WMDE: hm, did think of that, but I personally wanted to suggest that we explicitly set it/make it known
[13:27:40] <Lucas_WMDE>	 ok :)
[13:27:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T318605)', diff saved to https://phabricator.wikimedia.org/P36818 and previous config saved to /var/cache/conftool/dbconfig/20221027-132743-ladsgroup.json
[13:27:49] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37805/console" [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn)
[13:27:53] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[13:28:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1022 (T321312)', diff saved to https://phabricator.wikimedia.org/P36819 and previous config saved to /var/cache/conftool/dbconfig/20221027-132814-ladsgroup.json
[13:28:55] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] doc: add parameters for gitlab_runner and contint hosts, allow them (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn)
[13:29:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:29:24] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:850106|Define a default value for wgPageTriageMaxAge (T310974)]] (duration: 05m 33s)
[13:29:30] <stashbot>	 T310974: Extend PageTriageMaxAge (noindex) for unpatrolled articles at enwiki - https://phabricator.wikimedia.org/T310974
[13:29:46] <icinga-wm>	 PROBLEM - Check systemd state on wcqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:29:49] <Lucas_WMDE>	 anything else to deploy?
[13:30:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:30:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:30:02] <Lucas_WMDE>	 (I see the “real” enwiki PageTriage change is scheduled for next Monday)
[13:30:18] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37806/console" [puppet] - 10https://gerrit.wikimedia.org/r/850153 (owner: 10Jbond)
[13:30:21] <WMDE-Fisch>	 I have a labs patch but that just needs to be merged and then a git fetch to be nice I guess.
[13:30:25] <WMDE-Fisch>	 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/850118
[13:30:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10Cmjohnson) The mgmt links are still not working, The DNS is correct but I am unable to ping the servers.
[13:30:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P36820 and previous config saved to /var/cache/conftool/dbconfig/20221027-133031-ladsgroup.json
[13:30:51] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): [beta] Enable Kartographer show nearby clustering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850118 (https://phabricator.wikimedia.org/T321307) (owner: 10WMDE-Fisch)
[13:30:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:30:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850118 (https://phabricator.wikimedia.org/T321307) (owner: 10WMDE-Fisch)
[13:31:13] <WMDE-Fisch>	 ty ;-)
[13:31:13] <Lucas_WMDE>	 I put it into scap backport, IIUC it’ll decide to skip the sync on its own
[13:31:29] <wikibugs>	 (03CR) 10jenkins-bot: doc: add parameters for gitlab_runner and contint hosts, allow them [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn)
[13:31:43] <wikibugs>	 (03Merged) 10jenkins-bot: [beta] Enable Kartographer show nearby clustering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850118 (https://phabricator.wikimedia.org/T321307) (owner: 10WMDE-Fisch)
[13:32:16] <Lucas_WMDE>	 yup, it’s done already
[13:32:34] <Lucas_WMDE>	 it didn’t even log anything
[13:32:58] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[13:33:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:33:04] <wikibugs>	 (03PS1) 10Filippo Giunchedi: smokeping: add ensure parameter, set to present [puppet] - 10https://gerrit.wikimedia.org/r/850154 (https://phabricator.wikimedia.org/T169860)
[13:33:06] <wikibugs>	 (03PS1) 10Filippo Giunchedi: profile: absent smokeping [puppet] - 10https://gerrit.wikimedia.org/r/850155 (https://phabricator.wikimedia.org/T169860)
[13:33:08] <wikibugs>	 (03PS1) 10Filippo Giunchedi: smokeping: remove module and profile [puppet] - 10https://gerrit.wikimedia.org/r/850156 (https://phabricator.wikimedia.org/T169860)
[13:33:10] <wikibugs>	 (03PS1) 10Filippo Giunchedi: smokeping: remove ancillary data [puppet] - 10https://gerrit.wikimedia.org/r/850157 (https://phabricator.wikimedia.org/T169860)
[13:33:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] smokeping: add ensure parameter, set to present [puppet] - 10https://gerrit.wikimedia.org/r/850154 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[13:35:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T318950)', diff saved to https://phabricator.wikimedia.org/P36821 and previous config saved to /var/cache/conftool/dbconfig/20221027-133522-ladsgroup.json
[13:35:24] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2141.codfw.wmnet with reason: Maintenance
[13:35:27] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2141.codfw.wmnet with reason: Maintenance
[13:35:29] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[13:35:31] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2145.codfw.wmnet with reason: Maintenance
[13:35:45] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2145.codfw.wmnet with reason: Maintenance
[13:35:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2145 (T318950)', diff saved to https://phabricator.wikimedia.org/P36822 and previous config saved to /var/cache/conftool/dbconfig/20221027-133551-ladsgroup.json
[13:36:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:36:21] <wikibugs>	 (03PS1) 10Ssingh: cp4042: update site.pp and related configs for cp (text) role [puppet] - 10https://gerrit.wikimedia.org/r/850158 (https://phabricator.wikimedia.org/T317244)
[13:36:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2022', diff saved to https://phabricator.wikimedia.org/P36823 and previous config saved to /var/cache/conftool/dbconfig/20221027-133654-ladsgroup.json
[13:36:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:36:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:37:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:38:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T318950)', diff saved to https://phabricator.wikimedia.org/P36824 and previous config saved to /var/cache/conftool/dbconfig/20221027-133801-ladsgroup.json
[13:38:26] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] cp4042: update site.pp and related configs for cp (text) role [puppet] - 10https://gerrit.wikimedia.org/r/850158 (https://phabricator.wikimedia.org/T317244) (owner: 10Ssingh)
[13:38:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T321123)', diff saved to https://phabricator.wikimedia.org/P36825 and previous config saved to /var/cache/conftool/dbconfig/20221027-133827-marostegui.json
[13:38:29] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2170.codfw.wmnet with reason: Maintenance
[13:38:32] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/842763 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[13:38:33] <stashbot>	 T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123
[13:38:42] <wikibugs>	 (03PS2) 10Filippo Giunchedi: smokeping: add ensure parameter, set to present [puppet] - 10https://gerrit.wikimedia.org/r/850154 (https://phabricator.wikimedia.org/T169860)
[13:38:43] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2170.codfw.wmnet with reason: Maintenance
[13:38:44] <wikibugs>	 (03PS2) 10Filippo Giunchedi: profile: absent smokeping [puppet] - 10https://gerrit.wikimedia.org/r/850155 (https://phabricator.wikimedia.org/T169860)
[13:38:46] <wikibugs>	 (03PS2) 10Filippo Giunchedi: smokeping: remove module and profile [puppet] - 10https://gerrit.wikimedia.org/r/850156 (https://phabricator.wikimedia.org/T169860)
[13:38:48] <wikibugs>	 (03PS2) 10Filippo Giunchedi: smokeping: remove ancillary data [puppet] - 10https://gerrit.wikimedia.org/r/850157 (https://phabricator.wikimedia.org/T169860)
[13:38:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3312 (T321123)', diff saved to https://phabricator.wikimedia.org/P36826 and previous config saved to /var/cache/conftool/dbconfig/20221027-133848-marostegui.json
[13:39:05] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/832258 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[13:39:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] wmcs::nfs: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842758 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[13:39:23] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4042.ulsfo.wmnet with OS buster
[13:39:52] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] matomo/piwik: Enable profile::auto_restarts::service for Apache [puppet] - 10https://gerrit.wikimedia.org/r/832483 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[13:40:28] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[13:40:56] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:41:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T321123)', diff saved to https://phabricator.wikimedia.org/P36827 and previous config saved to /var/cache/conftool/dbconfig/20221027-134115-marostegui.json
[13:42:38] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Duplicate monitoring for systemd::timer::job - https://phabricator.wikimedia.org/T303253 (10fgiunchedi) >>! In T303253#8348894, @jcrespo wrote: >> we want to move "systemd unit failed" off Icinga and onto AM too >  > This is higher level, a...
[13:42:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P36828 and previous config saved to /var/cache/conftool/dbconfig/20221027-134251-ladsgroup.json
[13:43:00] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:43:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] matomo/piwik: Enable profile::auto_restarts::service for Apache [puppet] - 10https://gerrit.wikimedia.org/r/832483 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[13:44:06] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:45:04] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[13:45:06] <wikibugs>	 (03CR) 10Elukey: [WIP] wikilabels: move Postgres DB to its own (non-wmcs) role (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman)
[13:45:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P36829 and previous config saved to /var/cache/conftool/dbconfig/20221027-134537-ladsgroup.json
[13:45:49] <wikibugs>	 (03CR) 10Awight: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/850160 (https://phabricator.wikimedia.org/T321722) (owner: 10Awight)
[13:46:13] <wikibugs>	 (03CR) 10Clément Goubert: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/850095 (https://phabricator.wikimedia.org/T321786) (owner: 10Clément Goubert)
[13:46:13] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:46:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch idp.w.o to idp2002 [dns] - 10https://gerrit.wikimedia.org/r/850148 (owner: 10Muehlenhoff)
[13:50:41] <wikibugs>	 (03CR) 10WMDE-Fisch: [C: 03+1] Invite some of WMDE Tech Wishes team to poke around maps instances [puppet] - 10https://gerrit.wikimedia.org/r/850160 (https://phabricator.wikimedia.org/T321722) (owner: 10Awight)
[13:52:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2022', diff saved to https://phabricator.wikimedia.org/P36830 and previous config saved to /var/cache/conftool/dbconfig/20221027-135201-ladsgroup.json
[13:52:16] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[13:53:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P36831 and previous config saved to /var/cache/conftool/dbconfig/20221027-135307-ladsgroup.json
[13:53:40] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:55:35] <wikibugs>	 (03CR) 10Herron: [C: 03+2] "Thanks for the review and update!" [puppet] - 10https://gerrit.wikimedia.org/r/845033 (https://phabricator.wikimedia.org/T321086) (owner: 10Herron)
[13:55:44] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:56:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P36832 and previous config saved to /var/cache/conftool/dbconfig/20221027-135621-marostegui.json
[13:57:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P36833 and previous config saved to /var/cache/conftool/dbconfig/20221027-135757-ladsgroup.json
[13:58:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:59:55] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] Adapts specs and tests to kubeconform only [deployment-charts] - 10https://gerrit.wikimedia.org/r/842819 (https://phabricator.wikimedia.org/T316348) (owner: 10Clément Goubert)
[14:00:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T318950)', diff saved to https://phabricator.wikimedia.org/P36834 and previous config saved to /var/cache/conftool/dbconfig/20221027-140043-ladsgroup.json
[14:00:46] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[14:00:50] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[14:00:59] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[14:01:02] <wikibugs>	 (03PS8) 10Klausman: [WIP] wikilabels: move Postgres DB to its own (non-wmcs) role [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389)
[14:01:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] statistics : Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842763 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[14:01:22] <wikibugs>	 (03CR) 10Klausman: [WIP] wikilabels: move Postgres DB to its own (non-wmcs) role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman)
[14:01:41] <wikibugs>	 (03PS1) 10Vgutierrez: Add enterprisewikimedia.com as a ncredir domain [dns] - 10https://gerrit.wikimedia.org/r/850167 (https://phabricator.wikimedia.org/T321804)
[14:02:00] <wikibugs>	 (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37807/console" [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman)
[14:02:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for nginx on archiva/proxy [puppet] - 10https://gerrit.wikimedia.org/r/832258 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[14:02:45] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for appledora - https://phabricator.wikimedia.org/T321086 (10herron) 05Open→03Resolved a:03herron The requested access has been granted and will fully propagate within 30 minutes.  Transitioning this to r...
[14:02:55] <wikibugs>	 (03PS2) 10Vgutierrez: Add wikimediaenteprise.com as a ncredir domain [dns] - 10https://gerrit.wikimedia.org/r/850167 (https://phabricator.wikimedia.org/T321804)
[14:03:42] <wikibugs>	 (03Merged) 10jenkins-bot: Adapts specs and tests to kubeconform only [deployment-charts] - 10https://gerrit.wikimedia.org/r/842819 (https://phabricator.wikimedia.org/T316348) (owner: 10Clément Goubert)
[14:03:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:04:31] <wikibugs>	 (03CR) 10Clément Goubert: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/849501 (https://phabricator.wikimedia.org/T321201) (owner: 10Clément Goubert)
[14:05:02] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1083 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[14:05:05] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4042.ulsfo.wmnet with reason: host reimage
[14:07:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2022 (T321312)', diff saved to https://phabricator.wikimedia.org/P36835 and previous config saved to /var/cache/conftool/dbconfig/20221027-140708-ladsgroup.json
[14:08:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P36836 and previous config saved to /var/cache/conftool/dbconfig/20221027-140814-ladsgroup.json
[14:08:38] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4042.ulsfo.wmnet with reason: host reimage
[14:08:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (15) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:10:30] <wikibugs>	 (03PS2) 10Muehlenhoff: dumps::generation: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842760 (https://phabricator.wikimedia.org/T308013)
[14:11:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P36837 and previous config saved to /var/cache/conftool/dbconfig/20221027-141128-marostegui.json
[14:13:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T318605)', diff saved to https://phabricator.wikimedia.org/P36838 and previous config saved to /var/cache/conftool/dbconfig/20221027-141304-ladsgroup.json
[14:13:06] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance
[14:13:10] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[14:13:20] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance
[14:13:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P36839 and previous config saved to /var/cache/conftool/dbconfig/20221027-141326-ladsgroup.json
[14:13:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (15) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:14:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] dumps::generation: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842760 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[14:16:04] <wikibugs>	 (03PS2) 10Muehlenhoff: kafka: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842759 (https://phabricator.wikimedia.org/T308013)
[14:20:41] <wikibugs>	 (03PS9) 10Arturo Borrero Gonzalez: cloudgw2003-dev: give proper role and take over cloudgw2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/838125 (https://phabricator.wikimedia.org/T318704)
[14:22:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] kafka: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842759 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[14:23:02] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw2003-dev: give proper role and take over cloudgw2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/838125 (https://phabricator.wikimedia.org/T318704) (owner: 10Arturo Borrero Gonzalez)
[14:23:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T318950)', diff saved to https://phabricator.wikimedia.org/P36840 and previous config saved to /var/cache/conftool/dbconfig/20221027-142320-ladsgroup.json
[14:23:22] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2146.codfw.wmnet with reason: Maintenance
[14:23:27] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[14:23:36] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2146.codfw.wmnet with reason: Maintenance
[14:23:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2146 (T318950)', diff saved to https://phabricator.wikimedia.org/P36841 and previous config saved to /var/cache/conftool/dbconfig/20221027-142342-ladsgroup.json
[14:24:10] <icinga-wm>	 RECOVERY - Check systemd state on wcqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:24:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10Daimona)
[14:24:58] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudgw2001-dev.codfw.wmnet with OS bullseye
[14:25:33] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudgw2003-dev.codfw.wmnet with OS bullseye
[14:25:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T318950)', diff saved to https://phabricator.wikimedia.org/P36842 and previous config saved to /var/cache/conftool/dbconfig/20221027-142552-ladsgroup.json
[14:26:27] <wikibugs>	 (03PS1) 10Muehlenhoff: Add Daimona to CONTRIBUTORS [puppet] - 10https://gerrit.wikimedia.org/r/850170
[14:26:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T321123)', diff saved to https://phabricator.wikimedia.org/P36843 and previous config saved to /var/cache/conftool/dbconfig/20221027-142634-marostegui.json
[14:26:36] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2175.codfw.wmnet with reason: Maintenance
[14:26:41] <stashbot>	 T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123
[14:26:50] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2175.codfw.wmnet with reason: Maintenance
[14:26:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2175 (T321123)', diff saved to https://phabricator.wikimedia.org/P36844 and previous config saved to /var/cache/conftool/dbconfig/20221027-142656-marostegui.json
[14:27:56] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:28:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add Daimona to CONTRIBUTORS [puppet] - 10https://gerrit.wikimedia.org/r/850170 (owner: 10Muehlenhoff)
[14:28:46] <icinga-wm>	 PROBLEM - Check systemd state on wcqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:29:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:30:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T321123)', diff saved to https://phabricator.wikimedia.org/P36845 and previous config saved to /var/cache/conftool/dbconfig/20221027-143045-marostegui.json
[14:30:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] dispatch: introduce profile [puppet] - 10https://gerrit.wikimedia.org/r/824449 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi)
[14:30:53] <wikibugs>	 (03PS11) 10Filippo Giunchedi: dispatch: introduce profile [puppet] - 10https://gerrit.wikimedia.org/r/824449 (https://phabricator.wikimedia.org/T313229)
[14:31:50] <wikibugs>	 (03PS1) 10Jbond: R:rsync::manifests::server::module: add type validation [puppet] - 10https://gerrit.wikimedia.org/r/850171
[14:31:52] <wikibugs>	 (03PS1) 10Jbond: R:rsync::manifests::server::module: Strengthen types [puppet] - 10https://gerrit.wikimedia.org/r/850172
[14:31:54] <wikibugs>	 (03PS1) 10Jbond: rsync::server::module: drop auto_ferm_ipv6 parameter [puppet] - 10https://gerrit.wikimedia.org/r/850173
[14:32:50] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for appledora - https://phabricator.wikimedia.org/T321086 (10Isaac) Thanks all as always for the quick and helpful support in granting access!
[14:33:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] R:rsync::manifests::server::module: add type validation [puppet] - 10https://gerrit.wikimedia.org/r/850171 (owner: 10Jbond)
[14:34:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:34:12] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4042.ulsfo.wmnet with OS buster
[14:34:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] R:rsync::manifests::server::module: Strengthen types [puppet] - 10https://gerrit.wikimedia.org/r/850172 (owner: 10Jbond)
[14:35:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] rsync::server::module: drop auto_ferm_ipv6 parameter [puppet] - 10https://gerrit.wikimedia.org/r/850173 (owner: 10Jbond)
[14:35:45] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1083 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[14:38:11] <wikibugs>	 (03PS9) 10Klausman: wikilabels: move Postgres DB to its own (non-wmcs) role [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389)
[14:39:25] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] alerting_host: include dispatch profile [puppet] - 10https://gerrit.wikimedia.org/r/849021 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi)
[14:39:27] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:39:32] <wikibugs>	 (03PS5) 10Filippo Giunchedi: alerting_host: include dispatch profile [puppet] - 10https://gerrit.wikimedia.org/r/849021 (https://phabricator.wikimedia.org/T313229)
[14:40:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P36846 and previous config saved to /var/cache/conftool/dbconfig/20221027-144058-ladsgroup.json
[14:41:20] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw2003-dev.codfw.wmnet with reason: host reimage
[14:41:43] <wikibugs>	 (03CR) 10Ahmon Dancy: "Adding Jelto to make sure he's aware of what's going on in this area." [puppet] - 10https://gerrit.wikimedia.org/r/850153 (owner: 10Jbond)
[14:43:25] <wikibugs>	 (03PS1) 10Ssingh: cp4041: update site.pp and related configs for cp (text) role [puppet] - 10https://gerrit.wikimedia.org/r/850176 (https://phabricator.wikimedia.org/T317244)
[14:45:04] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw2003-dev.codfw.wmnet with reason: host reimage
[14:45:13] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] cp4041: update site.pp and related configs for cp (text) role [puppet] - 10https://gerrit.wikimedia.org/r/850176 (https://phabricator.wikimedia.org/T317244) (owner: 10Ssingh)
[14:45:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P36847 and previous config saved to /var/cache/conftool/dbconfig/20221027-144551-marostegui.json
[14:46:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P36848 and previous config saved to /var/cache/conftool/dbconfig/20221027-144602-ladsgroup.json
[14:46:08] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[14:47:08] <wikibugs>	 (03CR) 10Sergio Gimeno: [C: 03+1] [labs] GrowthExperiments: Use d3.js with new impact module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850098 (https://phabricator.wikimedia.org/T318854) (owner: 10Kosta Harlan)
[14:47:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[14:48:05] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:48:42] <moritzm>	 !log installing twitter-bootstrap4 security updates
[14:48:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:29] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4041.ulsfo.wmnet with OS buster
[14:50:08] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw2001-dev.codfw.wmnet with reason: host reimage
[14:50:11] <logmsgbot>	 !log aborrero@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudgw2001-dev.codfw.wmnet with reason: host reimage
[14:51:17] <moritzm>	 !log installing krb5 bugfix updates from Bullseye point release
[14:51:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:53:39] <icinga-wm>	 RECOVERY - Check systemd state on wcqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:53:54] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hieradata: add dispatch db_hostname [puppet] - 10https://gerrit.wikimedia.org/r/850178 (https://phabricator.wikimedia.org/T313229)
[14:55:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: add dispatch db_hostname [puppet] - 10https://gerrit.wikimedia.org/r/850178 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi)
[14:55:15] <wikibugs>	 (03PS2) 10Filippo Giunchedi: hieradata: add dispatch db_hostname [puppet] - 10https://gerrit.wikimedia.org/r/850178 (https://phabricator.wikimedia.org/T313229)
[14:55:20] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] kubernetes: Rename mwdebug to mw-debug [puppet] - 10https://gerrit.wikimedia.org/r/849502 (https://phabricator.wikimedia.org/T321201) (owner: 10Clément Goubert)
[14:55:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2] hieradata: add dispatch db_hostname [puppet] - 10https://gerrit.wikimedia.org/r/850178 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi)
[14:56:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P36849 and previous config saved to /var/cache/conftool/dbconfig/20221027-145604-ladsgroup.json
[14:57:37] <icinga-wm>	 PROBLEM - Check systemd state on wcqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:00:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P36850 and previous config saved to /var/cache/conftool/dbconfig/20221027-150058-marostegui.json
[15:01:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P36851 and previous config saved to /var/cache/conftool/dbconfig/20221027-150108-ladsgroup.json
[15:03:57] <icinga-wm>	 PROBLEM - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[15:03:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:04:07] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] mediawiki: Create new mw-debug deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/849501 (https://phabricator.wikimedia.org/T321201) (owner: 10Clément Goubert)
[15:05:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:05:41] <icinga-wm>	 RECOVERY - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[15:06:29] <claime>	 jouncebot: nowandnext
[15:06:29] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 53 minute(s)
[15:06:29] <jouncebot>	 In 0 hour(s) and 53 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221027T1600)
[15:06:35] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.5 point update - https://phabricator.wikimedia.org/T317416 (10MoritzMuehlenhoff)
[15:07:06] <claime>	 !log Switching k8s-experimental mwdebug service
[15:07:10] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1056 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:07:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:07:16] <moritzm>	 !log installing node-moment security updates
[15:07:18] <claime>	 !log Pausing mwdebug k8s deployments
[15:07:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:07:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:07:38] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Create new mw-debug deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/849501 (https://phabricator.wikimedia.org/T321201) (owner: 10Clément Goubert)
[15:07:44] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:08:04] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[15:09:18] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by ReadTimeoutError(HTTPSConnectionPool(host=citoid.svc.eqiad.wmnet, port=4003): Read timed out. (read timeout=15)): /?spec https://wikitech.wikimedia.org/wiki/Citoid
[15:10:03] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:11:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T318950)', diff saved to https://phabricator.wikimedia.org/P36852 and previous config saved to /var/cache/conftool/dbconfig/20221027-151111-ladsgroup.json
[15:11:12] <icinga-wm>	 RECOVERY - SSH on mw1332.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:11:13] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2153.codfw.wmnet with reason: Maintenance
[15:11:18] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[15:11:19] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudgw2001-dev.codfw.wmnet with OS bullseye
[15:11:26] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2153.codfw.wmnet with reason: Maintenance
[15:11:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2153 (T318950)', diff saved to https://phabricator.wikimedia.org/P36853 and previous config saved to /var/cache/conftool/dbconfig/20221027-151133-ladsgroup.json
[15:11:54] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: Create new mw-debug deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/849501 (https://phabricator.wikimedia.org/T321201) (owner: 10Clément Goubert)
[15:12:32] <claime>	 !log Silence ProbeDown instance="mwdebug:4444" for 1h
[15:12:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:36] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.5 point update - https://phabricator.wikimedia.org/T317416 (10MoritzMuehlenhoff)
[15:13:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T318950)', diff saved to https://phabricator.wikimedia.org/P36854 and previous config saved to /var/cache/conftool/dbconfig/20221027-151343-ladsgroup.json
[15:13:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (15) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:14:36] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1056 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[15:15:03] <jinxer-wm>	 (ProbeDown) resolved: (4) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:15:48] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4041.ulsfo.wmnet with reason: host reimage
[15:16:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T321123)', diff saved to https://phabricator.wikimedia.org/P36855 and previous config saved to /var/cache/conftool/dbconfig/20221027-151604-marostegui.json
[15:16:10] <stashbot>	 T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123
[15:16:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P36856 and previous config saved to /var/cache/conftool/dbconfig/20221027-151615-ladsgroup.json
[15:17:03] <wikibugs>	 (03PS2) 10Jbond: R:rsync::manifests::server::module: add type validation [puppet] - 10https://gerrit.wikimedia.org/r/850171
[15:17:17] <wikibugs>	 (03PS3) 10Jbond: R:rsync::manifests::server::module: add type validation [puppet] - 10https://gerrit.wikimedia.org/r/850171
[15:18:13] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[15:18:39] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[15:18:50] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[15:19:08] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] kubernetes: Rename mwdebug to mw-debug [puppet] - 10https://gerrit.wikimedia.org/r/849502 (https://phabricator.wikimedia.org/T321201) (owner: 10Clément Goubert)
[15:19:12] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[15:19:30] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4041.ulsfo.wmnet with reason: host reimage
[15:21:59] <wikibugs>	 10SRE, 10Data-Engineering-Operations, 10Data-Engineering-Planning, 10Mail: Change the analytics-alerts email alias to a mailman distribution list - https://phabricator.wikimedia.org/T315486 (10BTullis) After modifying the alias I also needed to set the following option in mailman. {F35641648,width=80%}
[15:22:01] <claime>	 !log Unpausing mwdebug k8s deployments
[15:22:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:22:32] <wikibugs>	 (03PS7) 10Ottomata: Declare mediawiki.page_change stream in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849144 (https://phabricator.wikimedia.org/T311129)
[15:22:36] <wikibugs>	 10SRE, 10Data-Engineering-Operations, 10Data-Engineering-Planning, 10Mail: Change the analytics-alerts email alias to a mailman distribution list - https://phabricator.wikimedia.org/T315486 (10BTullis) 05Open→03Resolved
[15:23:30] <claime>	 !log k8s-experimental mwdebug service switched to new deployment mw-debug
[15:23:31] <icinga-wm>	 RECOVERY - Check systemd state on wcqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:23:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:23:42] <claime>	 !log Removed silence ProbeDown instance="mwdebug:4444"
[15:23:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Declare mediawiki.page_change stream in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849144 (https://phabricator.wikimedia.org/T311129) (owner: 10Ottomata)
[15:23:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:45] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on wcqs2002.codfw.wmnet with reason: data reload
[15:26:12] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on wcqs2002.codfw.wmnet with reason: data reload
[15:26:17] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on wcqs2003.codfw.wmnet with reason: data reload
[15:26:33] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on wcqs2003.codfw.wmnet with reason: data reload
[15:26:34] <wikibugs>	 (03PS1) 10Clément Goubert: mw-debug: Remove old mwdebug deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/850184 (https://phabricator.wikimedia.org/T321201)
[15:26:42] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.remove-downtime for wcqs2002.codfw.wmnet
[15:26:42] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wcqs2002.codfw.wmnet
[15:28:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P36857 and previous config saved to /var/cache/conftool/dbconfig/20221027-152849-ladsgroup.json
[15:31:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P36858 and previous config saved to /var/cache/conftool/dbconfig/20221027-153121-ladsgroup.json
[15:31:23] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance
[15:31:26] <wikibugs>	 (03PS1) 10Btullis: Add a postgres user with our IPv6 network address [puppet] - 10https://gerrit.wikimedia.org/r/850185 (https://phabricator.wikimedia.org/T319440)
[15:31:28] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[15:31:37] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance
[15:31:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P36859 and previous config saved to /var/cache/conftool/dbconfig/20221027-153143-ladsgroup.json
[15:32:36] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37808/console" [puppet] - 10https://gerrit.wikimedia.org/r/850185 (https://phabricator.wikimedia.org/T319440) (owner: 10Btullis)
[15:32:51] <wikibugs>	 (03PS1) 10Ssingh: cp4050: update site.pp and related configs for cp (upload) role [puppet] - 10https://gerrit.wikimedia.org/r/850186 (https://phabricator.wikimedia.org/T317244)
[15:34:03] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Add a postgres user with our IPv6 network address [puppet] - 10https://gerrit.wikimedia.org/r/850185 (https://phabricator.wikimedia.org/T319440) (owner: 10Btullis)
[15:34:14] <icinga-wm>	 PROBLEM - Disk space on alert1001 is CRITICAL: DISK CRITICAL - /run/docker/netns/default is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=alert1001&var-datasource=eqiad+prometheus/ops
[15:34:44] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1056 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:38:27] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/849928 (https://phabricator.wikimedia.org/T303253) (owner: 10Giuseppe Lavagetto)
[15:40:36] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm minor optional nit" [puppet] - 10https://gerrit.wikimedia.org/r/849929 (https://phabricator.wikimedia.org/T303253) (owner: 10Giuseppe Lavagetto)
[15:42:28] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4041.ulsfo.wmnet with OS buster
[15:43:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P36860 and previous config saved to /var/cache/conftool/dbconfig/20221027-154356-ladsgroup.json
[15:44:11] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] cp4050: update site.pp and related configs for cp (upload) role [puppet] - 10https://gerrit.wikimedia.org/r/850186 (https://phabricator.wikimedia.org/T317244) (owner: 10Ssingh)
[15:44:23] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] wikilabels: move Postgres DB to its own (non-wmcs) role [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman)
[15:45:24] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1056 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[15:45:27] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] wikilabels: move Postgres DB to its own (non-wmcs) role [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman)
[15:45:28] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4050.ulsfo.wmnet with OS buster
[15:46:57] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[15:47:02] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[15:53:06] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] p::toolforge:harbor: use distro docker for bullseye (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/848356 (https://phabricator.wikimedia.org/T316541) (owner: 10David Caro)
[15:53:31] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw: don't configure anything on base dataplace interface [puppet] - 10https://gerrit.wikimedia.org/r/850190 (https://phabricator.wikimedia.org/T319184)
[15:53:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (14) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:54:56] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] p::ceph:mon: set permissions if mgr key parent dirs [puppet] - 10https://gerrit.wikimedia.org/r/849032 (https://phabricator.wikimedia.org/T321514) (owner: 10David Caro)
[15:55:31] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudgw2003-dev.codfw.wmnet with OS bullseye
[15:55:35] <wikibugs>	 (03PS1) 10Klausman: wikilables: fix wrong path for Postgres tuning.conf [puppet] - 10https://gerrit.wikimedia.org/r/850191
[15:56:28] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] cloudgw: don't configure anything on base dataplace interface [puppet] - 10https://gerrit.wikimedia.org/r/850190 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[15:56:57] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: don't configure anything on base dataplace interface [puppet] - 10https://gerrit.wikimedia.org/r/850190 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[15:57:01] <wikibugs>	 (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37809/console" [puppet] - 10https://gerrit.wikimedia.org/r/850191 (owner: 10Klausman)
[15:57:51] <wikibugs>	 (03CR) 10Klausman: wikilables: fix wrong path for Postgres tuning.conf [puppet] - 10https://gerrit.wikimedia.org/r/850191 (owner: 10Klausman)
[15:58:20] <wikibugs>	 (03PS2) 10Klausman: wikilabels: fix wrong path for Postgres tuning.conf [puppet] - 10https://gerrit.wikimedia.org/r/850191 (https://phabricator.wikimedia.org/T307389)
[15:59:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T318950)', diff saved to https://phabricator.wikimedia.org/P36861 and previous config saved to /var/cache/conftool/dbconfig/20221027-155902-ladsgroup.json
[15:59:04] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2167.codfw.wmnet with reason: Maintenance
[15:59:09] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[15:59:39] <logmsgbot>	 !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[15:59:40] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2167.codfw.wmnet with reason: Maintenance
[15:59:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3311 (T318950)', diff saved to https://phabricator.wikimedia.org/P36862 and previous config saved to /var/cache/conftool/dbconfig/20221027-155946-ladsgroup.json
[15:59:55] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid
[16:00:04] <jouncebot>	 jbond and rzl: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221027T1600).
[16:00:04] <jouncebot>	 zabe: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[16:00:10] <jinxer-wm>	 (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported
[16:00:24] <zabe>	 hey
[16:00:40] <zabe>	 o/
[16:00:53] <wikibugs>	 10SRE, 10FR-MW-Vagrant, 10Fundraising-Backlog, 10MediaWiki-Vagrant: Package XDebug 2.9 for apt.wikimedia.org - https://phabricator.wikimedia.org/T220406 (10Tgr) 05Open→03Invalid After {T271649} and the switch to PHP 7.4, Vagrant now uses XDebug 3.
[16:00:55] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[16:00:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T318950)', diff saved to https://phabricator.wikimedia.org/P36863 and previous config saved to /var/cache/conftool/dbconfig/20221027-160056-ladsgroup.json
[16:01:09] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Packaging: Update php-xdebug to 2.7.2 in apt.wikimedia.org - https://phabricator.wikimedia.org/T263933 (10Tgr) 05Open→03Invalid After {T271649} and the switch to PHP 7.4, Vagrant now uses XDebug 3.
[16:01:59] <wikibugs>	 (03PS1) 10Ssingh: cp4051: update site.pp and related configs for cp (upload) role [puppet] - 10https://gerrit.wikimedia.org/r/850192 (https://phabricator.wikimedia.org/T317244)
[16:02:01] <wikibugs>	 (03PS1) 10Ssingh: cp4052: update site.pp and related configs for cp (upload) role [puppet] - 10https://gerrit.wikimedia.org/r/850193 (https://phabricator.wikimedia.org/T317244)
[16:02:40] <wikibugs>	 (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37810/console" [puppet] - 10https://gerrit.wikimedia.org/r/850191 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman)
[16:03:01] <wikibugs>	 (03PS2) 10Jbond: R:rsync::manifests::server::module: Strengthen types [puppet] - 10https://gerrit.wikimedia.org/r/850172
[16:03:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (14) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:04:21] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2064 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:05:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P36864 and previous config saved to /var/cache/conftool/dbconfig/20221027-160511-ladsgroup.json
[16:05:18] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[16:07:34] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10ssingh)
[16:08:01] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10ssingh)
[16:08:18] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw: codfw1dev: don't hardcode interface names [puppet] - 10https://gerrit.wikimedia.org/r/850195 (https://phabricator.wikimedia.org/T319184)
[16:08:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (15) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:11:05] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2064 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[16:11:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/850171 (owner: 10Jbond)
[16:11:19] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4050.ulsfo.wmnet with reason: host reimage
[16:13:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (15) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:14:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2004.codfw.wmnet
[16:15:10] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4050.ulsfo.wmnet with reason: host reimage
[16:15:15] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10Krinkle)
[16:16:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P36865 and previous config saved to /var/cache/conftool/dbconfig/20221027-161602-ladsgroup.json
[16:16:34] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] wikilabels: fix wrong path for Postgres tuning.conf [puppet] - 10https://gerrit.wikimedia.org/r/850191 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman)
[16:18:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:18:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2004.codfw.wmnet
[16:20:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P36866 and previous config saved to /var/cache/conftool/dbconfig/20221027-162018-ladsgroup.json
[16:20:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2005.codfw.wmnet
[16:21:04] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] cloudgw: codfw1dev: don't hardcode interface names [puppet] - 10https://gerrit.wikimedia.org/r/850195 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[16:21:14] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/pcc-worker1001/37811/" [puppet] - 10https://gerrit.wikimedia.org/r/850195 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[16:21:27] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] cloudgw: codfw1dev: don't hardcode interface names [puppet] - 10https://gerrit.wikimedia.org/r/850195 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[16:23:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:23:38] <wikibugs>	 (03CR) 10Subramanya Sastry: [C: 03+1] Disable wgParserEnableLegacyMediaDOM on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844073 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra)
[16:23:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (14) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:24:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2005.codfw.wmnet
[16:25:10] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:27:21] <logmsgbot>	 !log dancy@deploy1002 Started scap: testing mw-debug
[16:27:55] <zabe>	 jbond, rzl: any of you around?
[16:28:43] <jbond>	 zabe: here sorry i missed the ping earlier let me take a look
[16:28:46] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudcontrol2005-dev, clouddb2002-dev, cloudgw2003-dev - https://phabricator.wikimedia.org/T306854 (10aborrero)
[16:29:01] <zabe>	 no worries
[16:29:11] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero)
[16:31:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P36867 and previous config saved to /var/cache/conftool/dbconfig/20221027-163109-ladsgroup.json
[16:33:19] <logmsgbot>	 !log dancy@deploy1002 scap failed: CalledProcessError Command 'sudo -u mwbuilder /usr/bin/make -C /srv/mwbuilder/release/make-container-image -f Makefile build-and-push-all-images http_proxy=http://webproxy.eqiad.wmnet:8080 https_proxy=http://webproxy.eqiad.wmnet:8080 GIT_BASE=https://gerrit.wikimedia.org/r/ MW_CONFIG_BRANCH=master workdir_volume=/srv/mediawiki-staging mv_image_name=docker-registry.discovery.wmnet/restric
[16:33:19] <logmsgbot>	 ted/mediawiki-multiversion webserver_image_name=docker-registry.discovery.wmnet/restricted/mediawiki-webserver MV_BASE_PACKAGES= MV_EXTRA_CA_CERT=' returned non-zero exit status 2. (duration: 05m 58s)
[16:34:33] <jbond>	 zabe: im just going to ping in serviceops to get someone elses to approve 843001 as im not too famiure with tha
[16:34:56] <zabe>	 sure
[16:35:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P36868 and previous config saved to /var/cache/conftool/dbconfig/20221027-163524-ladsgroup.json
[16:39:13] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4050.ulsfo.wmnet with OS buster
[16:41:59] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2064 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[16:43:13] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] cp4051: update site.pp and related configs for cp (upload) role [puppet] - 10https://gerrit.wikimedia.org/r/850192 (https://phabricator.wikimedia.org/T317244) (owner: 10Ssingh)
[16:45:10] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4051.ulsfo.wmnet with OS buster
[16:46:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T318950)', diff saved to https://phabricator.wikimedia.org/P36869 and previous config saved to /var/cache/conftool/dbconfig/20221027-164615-ladsgroup.json
[16:46:17] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2170.codfw.wmnet with reason: Maintenance
[16:46:20] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2170.codfw.wmnet with reason: Maintenance
[16:46:22] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[16:46:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3311 (T318950)', diff saved to https://phabricator.wikimedia.org/P36870 and previous config saved to /var/cache/conftool/dbconfig/20221027-164626-ladsgroup.json
[16:47:31] <logmsgbot>	 !log dancy@deploy1002 Started scap: testing mw-debug
[16:47:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T318950)', diff saved to https://phabricator.wikimedia.org/P36871 and previous config saved to /var/cache/conftool/dbconfig/20221027-164735-ladsgroup.json
[16:47:54] <wikibugs>	 (03CR) 10Klausman: [V: 03+1 C: 03+2] wikilabels: fix wrong path for Postgres tuning.conf [puppet] - 10https://gerrit.wikimedia.org/r/850191 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman)
[16:48:23] <logmsgbot>	 !log dancy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[16:48:23] <logmsgbot>	 !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[16:50:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:50:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P36872 and previous config saved to /var/cache/conftool/dbconfig/20221027-165031-ladsgroup.json
[16:50:33] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance
[16:50:37] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[16:50:46] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance
[16:50:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2180 (T318605)', diff saved to https://phabricator.wikimedia.org/P36873 and previous config saved to /var/cache/conftool/dbconfig/20221027-165052-ladsgroup.json
[16:51:48] <wikibugs>	 (03CR) 10BBlack: varnish: Conditionally set WMF-Last-Access cookie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996) (owner: 10BCornwall)
[16:52:24] <logmsgbot>	 !log dancy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[16:52:25] <wikibugs>	 (03CR) 10BCornwall: varnish: Conditionally set WMF-Last-Access cookie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996) (owner: 10BCornwall)
[16:52:38] <logmsgbot>	 !log dancy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[16:52:56] <logmsgbot>	 !log dancy@deploy1002 dancy: testing mw-debug synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet
[16:52:59] <logmsgbot>	 !log dancy@deploy1002 Sync cancelled.
[16:53:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (12) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:54:03] <wikibugs>	 (03CR) 10Andrew Bogott: "epic pcc run in progress: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37812/console" [puppet] - 10https://gerrit.wikimedia.org/r/849473 (owner: 10David Caro)
[16:55:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:55:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[16:55:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[16:55:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[16:55:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[16:56:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T318605)', diff saved to https://phabricator.wikimedia.org/P36874 and previous config saved to /var/cache/conftool/dbconfig/20221027-165659-ladsgroup.json
[16:57:05] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[16:58:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (12) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[17:00:04] <jouncebot>	 bd808: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221027T1700).
[17:01:39] <bd808>	 Nothing for me to deploy today.
[17:02:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P36875 and previous config saved to /var/cache/conftool/dbconfig/20221027-170242-ladsgroup.json
[17:03:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[17:11:04] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4051.ulsfo.wmnet with reason: host reimage
[17:12:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P36876 and previous config saved to /var/cache/conftool/dbconfig/20221027-171205-ladsgroup.json
[17:12:57] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10RobH) cp4040 and cp4048 had the DAC cable clicked in on the NIC, but not pressed in quite all the way.  Reseated and the link lights came up immediately.
[17:14:39] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4051.ulsfo.wmnet with reason: host reimage
[17:17:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P36877 and previous config saved to /var/cache/conftool/dbconfig/20221027-171749-ladsgroup.json
[17:27:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P36878 and previous config saved to /var/cache/conftool/dbconfig/20221027-172712-ladsgroup.json
[17:32:31] <wikibugs>	 (03PS10) 10Xcollazo: Modify jupyterhub config to point to conda-analytics instead of anaconda-wmf. [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088)
[17:32:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T318950)', diff saved to https://phabricator.wikimedia.org/P36879 and previous config saved to /var/cache/conftool/dbconfig/20221027-173255-ladsgroup.json
[17:32:58] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2173.codfw.wmnet with reason: Maintenance
[17:33:02] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[17:33:11] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2173.codfw.wmnet with reason: Maintenance
[17:33:13] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance
[17:33:15] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance
[17:33:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2173 (T318950)', diff saved to https://phabricator.wikimedia.org/P36880 and previous config saved to /var/cache/conftool/dbconfig/20221027-173322-ladsgroup.json
[17:33:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[17:35:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T318950)', diff saved to https://phabricator.wikimedia.org/P36881 and previous config saved to /var/cache/conftool/dbconfig/20221027-173532-ladsgroup.json
[17:35:38] <wikibugs>	 (03PS4) 10Jbond: R:rsync::manifests::server::module: add type validation [puppet] - 10https://gerrit.wikimedia.org/r/850171
[17:35:40] <wikibugs>	 (03PS3) 10Jbond: R:rsync::manifests::server::module: Strengthen types [puppet] - 10https://gerrit.wikimedia.org/r/850172
[17:35:42] <wikibugs>	 (03PS1) 10Jbond: C:statistics::rsyncd: use the nobody user explicitly [puppet] - 10https://gerrit.wikimedia.org/r/850233
[17:37:39] <icinga-wm>	 PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:37:40] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-stretch1001.eqiad.wmnet with OS bullseye
[17:37:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kafka-stretch1001.eqiad.wmnet with OS bullseye
[17:37:49] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti4002
[17:37:49] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti4002
[17:37:55] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti4006
[17:38:26] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti4006
[17:38:57] <wikibugs>	 (03PS2) 10Jbond: C:statistics::rsyncd: use the nobody user explicitly [puppet] - 10https://gerrit.wikimedia.org/r/850233
[17:39:15] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host ganeti4006.mgmt.ulsfo.wmnet with reboot policy FORCED
[17:39:18] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4051.ulsfo.wmnet with OS buster
[17:41:47] <wikibugs>	 (03CR) 10BBlack: [C: 04-1] varnish: Conditionally set WMF-Last-Access cookie (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996) (owner: 10BCornwall)
[17:42:11] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti4006.mgmt.ulsfo.wmnet with reboot policy FORCED
[17:42:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T318605)', diff saved to https://phabricator.wikimedia.org/P36882 and previous config saved to /var/cache/conftool/dbconfig/20221027-174219-ladsgroup.json
[17:42:30] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.dns.netbox
[17:42:36] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[17:43:37] <icinga-wm>	 RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms
[17:43:46] <wikibugs>	 (03PS4) 10BCornwall: varnish: Conditionally set WMF-Last-Access cookie [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996)
[17:44:19] <wikibugs>	 (03CR) 10BCornwall: varnish: Conditionally set WMF-Last-Access cookie (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996) (owner: 10BCornwall)
[17:45:12] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:45:43] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host ganeti4006.mgmt.ulsfo.wmnet with reboot policy FORCED
[17:47:55] <wikibugs>	 (03PS5) 10BCornwall: varnish: Conditionally set WMF-Last-Access cookie [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996)
[17:48:22] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data-Engineering-Planning, 10Wikidata, and 3 others: Clean up the rdf-streaming-updater-codfw container from thanos-swift. - https://phabricator.wikimedia.org/T316031 (10bking) 05Open→03Resolved
[17:48:33] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data-Engineering-Planning, 10Wikidata, and 4 others: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835 (10bking)
[17:50:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P36883 and previous config saved to /var/cache/conftool/dbconfig/20221027-175038-ladsgroup.json
[17:52:00] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti4006.mgmt.ulsfo.wmnet with reboot policy FORCED
[17:53:10] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] "LGTM! This is probably functionally correct now.  We should probably validate against existing VTC tests, and possibly define a new one if" [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996) (owner: 10BCornwall)
[17:54:58] <wikibugs>	 (03CR) 10Xcollazo: "File rename fixes puppet issue:" [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) (owner: 10Xcollazo)
[17:57:53] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH)
[17:59:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10Cmjohnson) @Ottomata this is failing in the installer because of the raid configuration. I probably do not have it set correctly.  Can you give...
[18:00:32] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10ssingh)
[18:01:19] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] apache: Drop ve.wikimedia.org rewrite [puppet] - 10https://gerrit.wikimedia.org/r/843569 (https://phabricator.wikimedia.org/T320890) (owner: 10Zabe)
[18:02:08] <wikibugs>	 10SRE: service::docker with 'latest' version behaves poorly if the host runs out of disk space - https://phabricator.wikimedia.org/T321851 (10taavi)
[18:02:39] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10RobH)
[18:02:41] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance
[18:02:48] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 72, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:02:55] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance
[18:03:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T318950)', diff saved to https://phabricator.wikimedia.org/P36884 and previous config saved to /var/cache/conftool/dbconfig/20221027-180301-ladsgroup.json
[18:03:07] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[18:03:46] <wikibugs>	 (03PS1) 10Btullis: Update the spark and spark-operator images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/850244 (https://phabricator.wikimedia.org/T318730)
[18:04:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T318950)', diff saved to https://phabricator.wikimedia.org/P36885 and previous config saved to /var/cache/conftool/dbconfig/20221027-180408-ladsgroup.json
[18:05:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P36886 and previous config saved to /var/cache/conftool/dbconfig/20221027-180545-ladsgroup.json
[18:06:38] <wikibugs>	 (03CR) 10Btullis: Update the spark and spark-operator images (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/850244 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[18:09:14] <wikibugs>	 (03PS1) 10Ahmon Dancy: scap::master: Clone the scap repo from gitlab [puppet] - 10https://gerrit.wikimedia.org/r/850246 (https://phabricator.wikimedia.org/T321847)
[18:11:14] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:12:44] <icinga-wm>	 PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:15:41] <wikibugs>	 (03PS2) 10Ahmon Dancy: beta::autoupdater: Remove more obsolete stuff after scap prep auto [puppet] - 10https://gerrit.wikimedia.org/r/753787
[18:16:10] <wikibugs>	 (03PS2) 10Ahmon Dancy: scap::master: Clone the scap repo from gitlab [puppet] - 10https://gerrit.wikimedia.org/r/850246 (https://phabricator.wikimedia.org/T321847)
[18:16:12] <wikibugs>	 (03PS1) 10Ahmon Dancy: git::clone: Append .git to clone url for gitlab source [puppet] - 10https://gerrit.wikimedia.org/r/850249
[18:19:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P36887 and previous config saved to /var/cache/conftool/dbconfig/20221027-181915-ladsgroup.json
[18:20:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T318950)', diff saved to https://phabricator.wikimedia.org/P36888 and previous config saved to /var/cache/conftool/dbconfig/20221027-182051-ladsgroup.json
[18:20:54] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2174.codfw.wmnet with reason: Maintenance
[18:20:58] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[18:21:07] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2174.codfw.wmnet with reason: Maintenance
[18:21:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2174 (T318950)', diff saved to https://phabricator.wikimedia.org/P36889 and previous config saved to /var/cache/conftool/dbconfig/20221027-182113-ladsgroup.json
[18:23:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T318950)', diff saved to https://phabricator.wikimedia.org/P36890 and previous config saved to /var/cache/conftool/dbconfig/20221027-182323-ladsgroup.json
[18:24:14] <icinga-wm>	 RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms
[18:24:59] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10Ottomata) What's the error you are getting?  See https://phabricator.wikimedia.org/T314160#8166075 and below.  In codfw, sda and sdb were mapped...
[18:28:07] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] C:statistics::rsyncd: use the nobody user explicitly [puppet] - 10https://gerrit.wikimedia.org/r/850233 (owner: 10Jbond)
[18:34:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P36891 and previous config saved to /var/cache/conftool/dbconfig/20221027-183421-ladsgroup.json
[18:38:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P36892 and previous config saved to /var/cache/conftool/dbconfig/20221027-183830-ladsgroup.json
[18:47:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[18:49:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T318950)', diff saved to https://phabricator.wikimedia.org/P36893 and previous config saved to /var/cache/conftool/dbconfig/20221027-184928-ladsgroup.json
[18:49:30] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[18:49:34] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[18:49:43] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[18:49:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T318950)', diff saved to https://phabricator.wikimedia.org/P36894 and previous config saved to /var/cache/conftool/dbconfig/20221027-184949-ladsgroup.json
[18:50:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T318950)', diff saved to https://phabricator.wikimedia.org/P36895 and previous config saved to /var/cache/conftool/dbconfig/20221027-185057-ladsgroup.json
[18:53:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P36896 and previous config saved to /var/cache/conftool/dbconfig/20221027-185336-ladsgroup.json
[19:00:45] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:01:56] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Add Apache configuration for vewikimedia [puppet] - 10https://gerrit.wikimedia.org/r/843001 (https://phabricator.wikimedia.org/T320890) (owner: 10Zabe)
[19:06:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P36897 and previous config saved to /var/cache/conftool/dbconfig/20221027-190604-ladsgroup.json
[19:08:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T318950)', diff saved to https://phabricator.wikimedia.org/P36898 and previous config saved to /var/cache/conftool/dbconfig/20221027-190843-ladsgroup.json
[19:08:45] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2176.codfw.wmnet with reason: Maintenance
[19:08:50] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[19:08:58] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2176.codfw.wmnet with reason: Maintenance
[19:09:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2176 (T318950)', diff saved to https://phabricator.wikimedia.org/P36899 and previous config saved to /var/cache/conftool/dbconfig/20221027-190904-ladsgroup.json
[19:11:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T318950)', diff saved to https://phabricator.wikimedia.org/P36900 and previous config saved to /var/cache/conftool/dbconfig/20221027-191114-ladsgroup.json
[19:14:27] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting deployment group membership for mfossati - https://phabricator.wikimedia.org/T321772 (10thcipriani) Approved as keeper of `deployment` group (probably fine to remove from `restricted` as it's a subset)  >>! In T321772#8348479, @mfossati wrote: > @thcipriani : not sure a...
[19:21:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P36901 and previous config saved to /var/cache/conftool/dbconfig/20221027-192110-ladsgroup.json
[19:26:09] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:26:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P36902 and previous config saved to /var/cache/conftool/dbconfig/20221027-192621-ladsgroup.json
[19:26:50] <wikibugs>	 (03PS3) 10BCornwall: prometheus: Add ats header/body size total metrics [puppet] - 10https://gerrit.wikimedia.org/r/845688 (https://phabricator.wikimedia.org/T284304)
[19:26:52] <wikibugs>	 (03CR) 10BCornwall: prometheus: Add ats header/body size total metrics (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/845688 (https://phabricator.wikimedia.org/T284304) (owner: 10BCornwall)
[19:29:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10Cmjohnson) @Ottomata yes, that is what's happening here
[19:30:24] <wikibugs>	 (03PS1) 10RobH: ganeti4006 [puppet] - 10https://gerrit.wikimedia.org/r/850260 (https://phabricator.wikimedia.org/T317247)
[19:31:14] <wikibugs>	 (03CR) 10RobH: [C: 03+2] ganeti4006 [puppet] - 10https://gerrit.wikimedia.org/r/850260 (https://phabricator.wikimedia.org/T317247) (owner: 10RobH)
[19:31:15] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH)
[19:36:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T318950)', diff saved to https://phabricator.wikimedia.org/P36903 and previous config saved to /var/cache/conftool/dbconfig/20221027-193617-ladsgroup.json
[19:36:19] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance
[19:36:24] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[19:36:33] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance
[19:36:34] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[19:36:50] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[19:36:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T318950)', diff saved to https://phabricator.wikimedia.org/P36904 and previous config saved to /var/cache/conftool/dbconfig/20221027-193656-ladsgroup.json
[19:38:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T318950)', diff saved to https://phabricator.wikimedia.org/P36905 and previous config saved to /var/cache/conftool/dbconfig/20221027-193803-ladsgroup.json
[19:41:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P36906 and previous config saved to /var/cache/conftool/dbconfig/20221027-194127-ladsgroup.json
[19:41:44] <wikibugs>	 (03CR) 10Andrew Bogott: openstack: make domain-aware (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/845004 (https://phabricator.wikimedia.org/T321349) (owner: 10Andrew Bogott)
[19:49:40] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host ganeti4006.mgmt.ulsfo.wmnet with reboot policy FORCED
[19:50:25] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti4006.mgmt.ulsfo.wmnet with reboot policy FORCED
[19:50:46] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host ganeti4006.mgmt.ulsfo.wmnet with reboot policy FORCED
[19:51:31] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti4006.mgmt.ulsfo.wmnet with reboot policy FORCED
[19:52:36] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10Ottomata) K, looks like RobH was able to [[ https://phabricator.wikimedia.org/T314160#8166665 | fix it somehow ]].
[19:53:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P36907 and previous config saved to /var/cache/conftool/dbconfig/20221027-195310-ladsgroup.json
[19:53:15] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10Cmjohnson) @Jclark-ctr  can you look at kafka-logging1005 and make sure the network cable is connected and the right port. Sorry to bug you on this...
[19:54:09] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10Cmjohnson) I added these to netbox but when I ran the dns script and home, nothing changed.
[19:56:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T318950)', diff saved to https://phabricator.wikimedia.org/P36908 and previous config saved to /var/cache/conftool/dbconfig/20221027-195634-ladsgroup.json
[19:56:40] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[19:59:30] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host ganeti4006.mgmt.ulsfo.wmnet with reboot policy FORCED
[19:59:49] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:00:05] <jouncebot>	 brennen: (Dis)respected human, time to deploy UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221027T2000). Please do the needful.
[20:00:05] <jouncebot>	 danisztls and koi: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:10] <jinxer-wm>	 (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported
[20:00:21] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:01:13] <danisztls>	 o/
[20:02:39] <koi>	 sorry, my mistake, I have nothing to deploy
[20:03:01] <kindrobot>	 Hey danisztls, we're about to deploy your patch. :)
[20:03:07] <thcipriani>	 koi: cool, thanks for clarifying
[20:03:29] <danisztls>	 new robot
[20:03:31] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.224 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:04:07] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48827 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:06:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834048 (https://phabricator.wikimedia.org/T318333) (owner: 10DDesouza)
[20:07:43] <wikibugs>	 (03PS5) 10Stef Dunlap: Deploy Research Incentive survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834048 (https://phabricator.wikimedia.org/T318333) (owner: 10DDesouza)
[20:07:56] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by kindrobot@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834048 (https://phabricator.wikimedia.org/T318333) (owner: 10DDesouza)
[20:08:09] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti4006.mgmt.ulsfo.wmnet with reboot policy FORCED
[20:08:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P36909 and previous config saved to /var/cache/conftool/dbconfig/20221027-200817-ladsgroup.json
[20:08:40] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy Research Incentive survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834048 (https://phabricator.wikimedia.org/T318333) (owner: 10DDesouza)
[20:08:46] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti4006.ulsfo.wmnet with OS buster
[20:08:54] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host ganeti4006.ulsfo.wmnet with OS buster
[20:08:54] <logmsgbot>	 !log kindrobot@deploy1002 Started scap: Backport for [[gerrit:834048|Deploy Research Incentive survey on enwiki (T318333)]]
[20:08:59] <stashbot>	 T318333: Deploy Research Incentive Survey targeting Sub-Saharan Africa and Latin America readers - https://phabricator.wikimedia.org/T318333
[20:09:14] <logmsgbot>	 !log kindrobot@deploy1002 kindrobot and dani: Backport for [[gerrit:834048|Deploy Research Incentive survey on enwiki (T318333)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet
[20:10:05] <kindrobot>	 danisztls: your changes are ready to check on debug
[20:10:53] <danisztls>	 kindrobot: lgtm
[20:13:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[20:14:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[20:14:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[20:15:26] <logmsgbot>	 !log kindrobot@deploy1002 Finished scap: Backport for [[gerrit:834048|Deploy Research Incentive survey on enwiki (T318333)]] (duration: 06m 32s)
[20:15:32] <stashbot>	 T318333: Deploy Research Incentive Survey targeting Sub-Saharan Africa and Latin America readers - https://phabricator.wikimedia.org/T318333
[20:15:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[20:16:08] <kindrobot>	 danisztls: deployment finished
[20:16:22] <danisztls>	 kindrobot: thanks
[20:16:55] <kindrobot>	 !log End of UTC late backport deployment window
[20:16:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:20:46] <wikibugs>	 (03PS1) 10Stang: Add main page on non-English privatewiki to wgWhitelistRead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850266 (https://phabricator.wikimedia.org/T321796)
[20:20:48] <thcipriani>	 \o/ new backport deployers
[20:21:04] <kindrobot>	 :D
[20:21:31] <icinga-wm>	 PROBLEM - Check systemd state on doc1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:23:19] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:23:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T318950)', diff saved to https://phabricator.wikimedia.org/P36910 and previous config saved to /var/cache/conftool/dbconfig/20221027-202323-ladsgroup.json
[20:23:25] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1107.eqiad.wmnet with reason: Maintenance
[20:23:30] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[20:23:38] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1107.eqiad.wmnet with reason: Maintenance
[20:23:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1107 (T318950)', diff saved to https://phabricator.wikimedia.org/P36911 and previous config saved to /var/cache/conftool/dbconfig/20221027-202345-ladsgroup.json
[20:24:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107 (T318950)', diff saved to https://phabricator.wikimedia.org/P36912 and previous config saved to /var/cache/conftool/dbconfig/20221027-202452-ladsgroup.json
[20:29:04] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti4006.ulsfo.wmnet with reason: host reimage
[20:32:32] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti4006.ulsfo.wmnet with reason: host reimage
[20:35:36] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti4006.ulsfo.wmnet with OS buster
[20:35:43] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host ganeti4006.ulsfo.wmnet with OS buster executed with errors:...
[20:36:11] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti4006.ulsfo.wmnet with OS bullseye
[20:36:19] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host ganeti4006.ulsfo.wmnet with OS bullseye
[20:39:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107', diff saved to https://phabricator.wikimedia.org/P36913 and previous config saved to /var/cache/conftool/dbconfig/20221027-203959-ladsgroup.json
[20:42:47] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] cp4052: update site.pp and related configs for cp (upload) role [puppet] - 10https://gerrit.wikimedia.org/r/850193 (https://phabricator.wikimedia.org/T317244) (owner: 10Ssingh)
[20:47:00] <hoo>	 !log Running extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for commonswiki (T300770)
[20:47:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:47:06] <stashbot>	 T300770: Special:UnconnectedPages for main namespace is slow (ca. 10 seconds) - https://phabricator.wikimedia.org/T300770
[20:47:42] <hoo>	 !log Running extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for enwiki, enwiktionary (T300770)
[20:47:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:47:53] <wikibugs>	 (03PS2) 10Ssingh: cp4052: update site.pp and related configs for cp (upload) role [puppet] - 10https://gerrit.wikimedia.org/r/850193 (https://phabricator.wikimedia.org/T317244)
[20:50:54] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS buster
[20:52:12] <wikibugs>	 (03PS1) 10Ssingh: cp4040: update site.pp and related configs for cp (text) role [puppet] - 10https://gerrit.wikimedia.org/r/850273 (https://phabricator.wikimedia.org/T317244)
[20:53:31] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti4006.ulsfo.wmnet with reason: host reimage
[20:55:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107', diff saved to https://phabricator.wikimedia.org/P36914 and previous config saved to /var/cache/conftool/dbconfig/20221027-205505-ladsgroup.json
[20:56:07] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS buster
[20:56:08] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti4006.ulsfo.wmnet with reason: host reimage
[20:56:31] <sukhe>	 !log sudo ipmitool -I lanplus -H "cp4052.mgmt.ulsfo.wmnet" -U root -E chassis power cycle
[20:56:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:57:32] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on ganeti4006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.128.0.38. Check system logs on 10.128.0.38 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T321863 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[20:57:37] <wikibugs>	 10SRE, 10ops-ulsfo: Degraded RAID on ganeti4006 - https://phabricator.wikimedia.org/T321863 (10ops-monitoring-bot)
[20:58:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:59:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:00:13] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on centrallog2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs
[21:02:07] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on centrallog2002 is OK: SSL OK - Certificate centrallog2002.codfw.wmnet valid until 2026-09-27 13:35:26 +0000 (expires in 1430 days) https://wikitech.wikimedia.org/wiki/Logs
[21:02:32] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4052.mgmt.ulsfo.wmnet with reboot policy FORCED
[21:04:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:08:35] <icinga-wm>	 PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[21:10:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107 (T318950)', diff saved to https://phabricator.wikimedia.org/P36915 and previous config saved to /var/cache/conftool/dbconfig/20221027-211012-ladsgroup.json
[21:10:14] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance
[21:10:18] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[21:10:27] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp4052.mgmt.ulsfo.wmnet with reboot policy FORCED
[21:10:27] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance
[21:10:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1118 (T318950)', diff saved to https://phabricator.wikimedia.org/P36916 and previous config saved to /var/cache/conftool/dbconfig/20221027-211034-ladsgroup.json
[21:11:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T318950)', diff saved to https://phabricator.wikimedia.org/P36917 and previous config saved to /var/cache/conftool/dbconfig/20221027-211142-ladsgroup.json
[21:12:15] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4052.mgmt.ulsfo.wmnet with reboot policy FORCED
[21:13:22] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti4006.ulsfo.wmnet with OS bullseye
[21:13:28] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host ganeti4006.ulsfo.wmnet with OS bullseye completed: - ganeti4...
[21:13:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (11) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:14:10] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH)
[21:16:57] <icinga-wm>	 RECOVERY - Check systemd state on doc1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:18:54] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10RobH) When trying to run the sre.hosts.provision script on cp4052, I get the following issue:    ` [1/30, retrying in 30.00s] Polling task: JID_669057087428 not co...
[21:20:21] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:20:45] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp4052.mgmt.ulsfo.wmnet with reboot policy FORCED
[21:21:09] <icinga-wm>	 PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:26:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P36918 and previous config saved to /var/cache/conftool/dbconfig/20221027-212648-ladsgroup.json
[21:28:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (11) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:34:21] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS buster
[21:34:40] <wikibugs>	 10SRE, 10ops-ulsfo, 10Infrastructure-Foundations: Degraded RAID on ganeti4006 - https://phabricator.wikimedia.org/T321863 (10Peachey88)
[21:41:03] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS buster
[21:41:19] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:41:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P36919 and previous config saved to /var/cache/conftool/dbconfig/20221027-214154-ladsgroup.json
[21:45:39] <wikibugs>	 (03CR) 10Andrew Bogott: "prod pcc run looks good:" [puppet] - 10https://gerrit.wikimedia.org/r/849473 (owner: 10David Caro)
[21:46:28] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp4052
[21:46:31] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp4052
[21:46:38] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS buster
[21:56:05] <wikibugs>	 (03PS1) 10BBlack: Add fake digicert-2022 keys [labs/private] - 10https://gerrit.wikimedia.org/r/850279
[21:56:14] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS buster
[21:56:41] <wikibugs>	 (03PS1) 10Ssingh: Revert "cp4052: update site.pp and related configs for cp (upload) role" [puppet] - 10https://gerrit.wikimedia.org/r/850085
[21:57:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T318950)', diff saved to https://phabricator.wikimedia.org/P36920 and previous config saved to /var/cache/conftool/dbconfig/20221027-215701-ladsgroup.json
[21:57:03] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance
[21:57:08] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[21:57:17] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance
[21:57:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T318950)', diff saved to https://phabricator.wikimedia.org/P36921 and previous config saved to /var/cache/conftool/dbconfig/20221027-215723-ladsgroup.json
[21:57:55] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] Revert "cp4052: update site.pp and related configs for cp (upload) role" [puppet] - 10https://gerrit.wikimedia.org/r/850085 (owner: 10Ssingh)
[21:58:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T318950)', diff saved to https://phabricator.wikimedia.org/P36922 and previous config saved to /var/cache/conftool/dbconfig/20221027-215831-ladsgroup.json
[22:09:11] <wikibugs>	 (03CR) 10BBlack: [V: 03+2 C: 03+2] Add fake digicert-2022 keys [labs/private] - 10https://gerrit.wikimedia.org/r/850279 (owner: 10BBlack)
[22:09:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[22:13:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P36923 and previous config saved to /var/cache/conftool/dbconfig/20221027-221337-ladsgroup.json
[22:13:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (11) High Kubernetes API latency (LIST customresourcedefinitions) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[22:14:37] <wikibugs>	 (03PS1) 10BBlack: Add digicert-2022 unified cert files [puppet] - 10https://gerrit.wikimedia.org/r/850285 (https://phabricator.wikimedia.org/T313328)
[22:14:39] <wikibugs>	 (03PS1) 10BBlack: Add digicert-2022 to available unified set [puppet] - 10https://gerrit.wikimedia.org/r/850286 (https://phabricator.wikimedia.org/T313328)
[22:14:41] <wikibugs>	 (03PS1) 10BBlack: Switch drmrs, eqsin, esams to digicert-2022 [puppet] - 10https://gerrit.wikimedia.org/r/850287 (https://phabricator.wikimedia.org/T313328)
[22:14:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[22:16:37] <wikibugs>	 (03CR) 10Andrew Bogott: "VM pcc results: https://puppet-compiler.wmflabs.org/pcc-worker1003/37824/" [puppet] - 10https://gerrit.wikimedia.org/r/849473 (owner: 10David Caro)
[22:16:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add digicert-2022 unified cert files [puppet] - 10https://gerrit.wikimedia.org/r/850285 (https://phabricator.wikimedia.org/T313328) (owner: 10BBlack)
[22:17:40] <bblack>	 ^ 22:15:07 The following are missing a SPDX licence header:
[22:17:45] <bblack>	 really?
[22:18:47] <wikibugs>	 (03CR) 10BBlack: [V: 03+2 C: 03+2] Add digicert-2022 unified cert files [puppet] - 10https://gerrit.wikimedia.org/r/850285 (https://phabricator.wikimedia.org/T313328) (owner: 10BBlack)
[22:19:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:24:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:28:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P36924 and previous config saved to /var/cache/conftool/dbconfig/20221027-222844-ladsgroup.json
[22:28:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (12) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[22:37:21] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10Andrew) 05Open→03Resolved
[22:37:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Andrew)
[22:43:21] <wikibugs>	 (03PS1) 10Andrew Bogott: Purge the last few references to labstore100[67] [puppet] - 10https://gerrit.wikimedia.org/r/850296 (https://phabricator.wikimedia.org/T319217)
[22:43:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T318950)', diff saved to https://phabricator.wikimedia.org/P36925 and previous config saved to /var/cache/conftool/dbconfig/20221027-224350-ladsgroup.json
[22:43:53] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1128.eqiad.wmnet with reason: Maintenance
[22:43:59] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[22:44:07] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1128.eqiad.wmnet with reason: Maintenance
[22:44:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1128 (T318950)', diff saved to https://phabricator.wikimedia.org/P36926 and previous config saved to /var/cache/conftool/dbconfig/20221027-224413-ladsgroup.json
[22:44:53] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts labstore1007
[22:47:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[22:49:39] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.dns.netbox
[22:50:38] <icinga-wm>	 PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:51:44] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:51:45] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts labstore1007
[22:53:21] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts labstore1006
[22:53:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T318950)', diff saved to https://phabricator.wikimedia.org/P36927 and previous config saved to /var/cache/conftool/dbconfig/20221027-225322-ladsgroup.json
[22:53:31] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[22:57:32] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.dns.netbox
[23:00:34] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[23:00:35] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts labstore1006
[23:01:27] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Purge the last few references to labstore100[67] [puppet] - 10https://gerrit.wikimedia.org/r/850296 (https://phabricator.wikimedia.org/T319217) (owner: 10Andrew Bogott)
[23:04:37] <wikibugs>	 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T319217 (10Andrew) a:05Andrew→03Jclark-ctr
[23:04:43] <wikibugs>	 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T319217 (10Andrew) btw, I believe each of these servers is attached to an external disk shelf -- those shelves should also be decom'd.
[23:08:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P36928 and previous config saved to /var/cache/conftool/dbconfig/20221027-230828-ladsgroup.json
[23:09:38] <icinga-wm>	 PROBLEM - SSH on mw1334.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:22:40] <icinga-wm>	 RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:23:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P36929 and previous config saved to /var/cache/conftool/dbconfig/20221027-232335-ladsgroup.json
[23:38:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T318950)', diff saved to https://phabricator.wikimedia.org/P36930 and previous config saved to /var/cache/conftool/dbconfig/20221027-233842-ladsgroup.json
[23:38:44] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance
[23:38:49] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[23:38:58] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance
[23:39:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1132 (T318950)', diff saved to https://phabricator.wikimedia.org/P36931 and previous config saved to /var/cache/conftool/dbconfig/20221027-233903-ladsgroup.json
[23:41:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T318950)', diff saved to https://phabricator.wikimedia.org/P36932 and previous config saved to /var/cache/conftool/dbconfig/20221027-234111-ladsgroup.json
[23:51:09] <icinga-wm>	 RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:56:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P36933 and previous config saved to /var/cache/conftool/dbconfig/20221027-235618-ladsgroup.json
[23:58:38] <wikibugs>	 (03PS2) 10Ssingh: cp4040: update site.pp and related configs for cp (text) role [puppet] - 10https://gerrit.wikimedia.org/r/850273 (https://phabricator.wikimedia.org/T317244)
[23:59:40] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] cp4040: update site.pp and related configs for cp (text) role [puppet] - 10https://gerrit.wikimedia.org/r/850273 (https://phabricator.wikimedia.org/T317244) (owner: 10Ssingh)